Making Data Simple: Data movement at size and scale
Michael Springgay, IBM STSM, Db2 Data Warehouse Development, and Rajani Maindiratta, IBM Senior Manager, Db2 Data Warehouse on Cloud Development for Load, share their experiences moving data for customers big and small. What are the options for data movement and what is the impact of cloud?
07.30 Learn more about Aspera
16.30 Learn the difference between TCP and UDP
22.50 Learn more about the Amazon Snowball
25.45 Find The Harry Bosch Novels by Michael Connelly
26.00 Find An Autobiography: The Story of My Experiments with Truth by Mahatma Gandhi
26.30 Find The Coaching Habit: Say, Less, Ask More & Change the Way You Lead Forever by Michael Bungay Stanier
Hungry for more? Check out our previous podcast episodes of Making Data Simple:
- Episode 1: Making Data Simple: The big data problem
- Episode 2: Making Data Simple: End of tech companies
- Episode 3: Making Data Simple: A new definition of client care
- Episode 4: Making Data Simple: Will machines take our jobs?
- Episode 5: Making Data Simple: Growth Hacking - Not just for start ups
- Episode 6: Making Data Simple: From 2D to 3D -- Augmented reality data visualization
- Episode 7: Making Data Simple: The 5 areas businesses MUST get right
- Episode 8: Making Data Simple: How data science is helping to improve aviation
- Episode 9: Making Data Simple: Making data fun & easy with Caleb Curry
- Episode 11: Making Data Simple: Cloud computing, part 1
Al Martin: Hi, guys. This is Al Martin Making Data Simple Series. Today we're going to talk about data movement. And I've got two expert speakers in the room with me. I've got Michael and Rajani. And why don't I just start, let you guys introduce yourself and then I've got plenty of questions around data movement. Go ahead, Michael.
Michael Springgay: Okay, so, I'm Mike Springgay. I'm an architect within the Db2 warehouse development organization focused on data movement and compatibility.
Rajani Maindiratta: Hi, I am Rajani Maindiratta. I manage the load team in Db2 data warehouse on cloud, now moving to data other areas well.
Al Martin: Fantastic. So, look, I deal with all, you know, unified governance, fiber data management, visualization in our entire point of view, our entire strategy. I think the difficult two we have today is semi-structured, structured, unstructured data, Skypes, on-premise, cloud, transactional workloads but — oh well and I could go on saying look we're on cloud. We're on hosted. We're on private cloud. We have an appliance.
This all doesn't work without data movements. That's how important data movement is. And you can't move data you got none of that, right? So I guess the first question, Rajani, and I'll go with you first if that's all right, what is data movement in your definition and what is it important?
Rajani Maindiratta: Okay, so kind of what you explained earlier, I see it as just one of the challenges around data in general. As you said, companies — organizations have — are faced with the deluge of data and something you can manage. But they need to understand that they like to — would like to mine and to make business decisions.
And there's three or four different difficulties with that. One is the volume, the sheer volume of data that's coming in from different aspects day to day transactions, e-commerce transactions, IoT, social mobile data, all of that's coming into the organization. So they need to understand what's coming in.
They also need to deal with the speed in which it's coming in and also it's coming from a number of different sources. So because the businesses would like to, organizations (both) would like to try to make business decisions to potentially increase revenue, work with — improve efficiencies in the organization, perhaps solve a business problem they need to have a repository through where this data resides coming from multiple sources where they can like do analytics, run the reports, understand what's happening in their organization so they can make those decisions.
Data movement comes into play because they need to be able to do: get the data into the repository with data warehouse which is simply repository different sources of data. They need to efficiently be able to move the data so that they can quickly react to what the data is telling them whether, you know, maybe they need to focus an area because they're somewhere else or it — the possibilities are really endless.
So it's about being able to quickly access the data, you know, with the cloud brings a whole other area of challenges around bandwidth and, you know, it's petabytes of data, terabytes, petabytes of data. How do you move that into a data center that's, you know, physically thousands of miles away. So those are some of the challenges. That's why it's important the thinking that data is — it's something you need to really mine in order to gain benefits whether it's within the organization, whether it's governments, whether, you know...
Al Martin: What are the most common customer stories or use cases you're working with today? And I guess or if there's a unique one I'd be curious as to that as well, I mean one of the strangest ones you have. But what is a common customer story?
Rajani Maindiratta: Customer use cases, there's been a few that we encountered. Healthcare has been one where we had a — we have customers who, you know, daily they need to mine physician records test results, things like that at a daily basis right? They're constantly analyzing the data to know what's happening and from a healthcare perspective.
There's also marketing trends that need to be mined from the retail perspective. One of our customers actually had a reason — one of the products they're working on is they get data on — if they analyze trends on where advertising works for example. That's another typical use case. Healthcare...
Al Martin: And what are they moving data from into where?
Rajani Maindiratta: Oh okay. So we see different sources right Oracle is one the customer is moving data from Oracle to DashDB or Db2 for (health) on cloud for example.
Al Martin: Is that Oracle behind the firewall, I mean Oracle proper? In other words…
Rajani Maindiratta: Yes on-prem into warehouse on cloud.
Al Martin: So did...
Rajani Maindiratta: So yes so that's why in use case there's the Netezza use case right the use case of course.
Al Martin: That's the appliance they use...
Rajani Maindiratta: It's the appliance use case?
Al Martin: The appliance use case one in the cloud as well?
Rajani Maindiratta: That's right yes.
Al Martin: Yes.
Rajani Maindiratta: That's happening as well.
Al Martin: So, Mike, where are you spending most of your time at today? I mean I know you've got a lot of passion. We talk a lot about data movement. You know, kind of carrying on the conversation we just had, you know, what do you see as kind of the biggest challenge? I'm kind of — well, I got a multi-part question here.
What do you see as the biggest challenge that you're facing? Are you doing more on-prem to cloud, on-prem to on-prem in terms of data movement? What do — how do those challenges differ and what kind of speeds can — what are the sides of the data that many of these clients are moving?
Michael Springgay: Right, So I think your last point is really where the biggest challenge is is the size of the data has, you know, expanding tremendously in the, you know, the last few years. And that certainly causes challenges on premise but it also significantly challenges you when you talk about moving into the cloud. So a lot of the area of focus is really in both spaces.
Most in customers, companies are in a hybrid situation. And they really are moving everything off of on-premise. They still have on-premise for their core applications but they want to move some lower priority workloads into the cloud but they need to get the data there and the data has to get there quickly. And that's a challenge for most cloud providers.
And in the industry most people are looking at disk shipping type technology with large appliance based products for shipping data. IBM is no different in that space, you know, for really large terabytes of data when I mean the over 25 terabytes of data and petabytes, you know, you want to ship the disk. But the challenge with that is you still have the legacy because you, you know, unload it. You got to purolate it or ship it with, you know, FedEx or whatever to get the data into the data center. It's always a delay.
So one of the interesting things that, you know, IBM is doing is that we bought a company called Aspera and we're using that Aspera technology to allow us to transfer over the Internet our larger volumes of data so we can get up into that 20, 25 terabyte range and still reasonably quickly move the data from on-premise into cloud.
Al Martin: When they're going that on-premise into the cloud though where is the line of distinction between hey I can move that data to a great technology like Aspera and what kind of speed are you talking by the way? And then or for you got to go get FedEx or something to ship the data. What's the line of distinction there?
Michael Springgay: So usually we recommend around that 20 to 25 terabyte range. I mean obviously it's going to be somewhat dependent on the customer's bandwidth and their own willingness to use that bandwidth to move the data or even the time they're willing to wait. But at that point you sort of after 25 terabytes you're probably in the breaking zone of the amount of time it would take it's probably faster to put it onto an appliance ship appliance can get it unloaded into the data center.
The speed we're seeing is, you know, typically with Aspera we can get up into the 200 gigabytes an hour transfer rates assuming we can saturate the whole network. Without Aspera though, you know, most customers would see probably a 5 or 20 gigabyte an hour. So it's really not that usable without that technology to move data in the 20 to 25 terabytes because you’re talking days.
And when you're talking about the Internet it's also not necessarily the most robust connection so, you know, you have to assume some amount of drop. So if you're moving a terabyte file and you lose your connection then you better start over. So it's being able to get that through with the reliancy of Aspera and speeds to make it more viable to where viable to move larger data.
Al Martin: So, Rajani, so that's good. Thank you but Rajani do you — where are the inhibitors that you face most? I mean what's the biggest challenge? In other words is it the network? Is — do we find ourselves in a situation where even if we could use a spare by example that we end up, you know, we end up, you know, using FedEx or something because of a firewall issue and they don't want to like open it up for what a couple days in terms of driving this? Is that the biggest issue or do you see the biggest issue is some other form of technology?
Rajani Maindiratta: I see bandwidth as a big issue come up many times. And that's when customers do look to the disk shipping as we talked about. Another challenge in that area is just ease of use right, making it simple for customers to be able to do this data movement. And that's where all their technologies like would come into play to help solve that and external tables to something that...
Al Martin: Well, how does lift solve that? I know I'm always pushing like single like a one button experience. Where we at?
Rajani Maindiratta: In our experience, over the last couple of years, we found that customers like win the experience so with CLI has brought in to make that experience kind of feel that the customer can integrate that data movement process into their own scripts and all. So CLI has made that experience...
Michael Springgay: Yes, especially in the Netezza space where we're able to basically instrument through scripting with all the extract...
Rajani Maindiratta: Yes.
Michael Springgay: ...to load into the Db2 warehouse. So it really makes it an end-to-end process, maybe not exactly one click, but very seamless options the customer has.
Al Martin: Well how close are we at a one click?
Michael Springgay: I know that that's easy for executive to come in and say, "We need pushbutton data movement."
Al Martin: Right, but what's real? What's fact? What's fiction?
Michael Springgay: Well, I mean, I think with data movement there is challenges with simple one click because you have to understand the source and the target of your extract. So there is some knowledge of shaping the data that you have that has to be presented.
But I think we're at the point where really you do just provide the shape of the data. So that's knowledge you have typically. And then really is you push go and it does the extract and moves it up.
Al Martin: So, in short, there is pushbutton, but you've got to know the source, the destination and you've got to set things up to do it correctly. Are you doing a lot of ground or cloud right now?
Michael Springgay: Yes. There is still a lot of customers looking to move to the cloud. And all of them are using — most of them are using at least the Lift technology to drive that.
Al Martin: And when they do I mean do you get more requests that's just ground to ground or do you get a lot of clients that are coming in and say because of, you know, the difficulty in ground to Cloud and in meeting a technology like Aspera? Is that where most the urgent requests or the main queries come in from clients?
Michael Springgay: I think because it is more challenging it is where we seen our focus to help them deal with it. Certainly on the ground to ground the customers are asking or if there are more traditional tools that people have been using that hasn't really changed all that much. So there they're really just looking at speed is important because even in ground to ground the distance between data centers is growing.
It used to be that people were moving data probably from box A to box B and they sat beside each other. I mean now people are moving data across data centers that are probably on the other side of the country. So even there you see the challenges of network coming into play.
Al Martin: So, Michael, so the one question I do have is that IBM typically takes a point of view where we're bringing analytics to the data and like we're moving this data around which is fine, but I know it seems like a lot of different vendors or companies, you know, they charge you like nil to nothing to move the data into their cloud. But once it's there if you want to move it around, you want to like get it out of the cloud that's where they start making their money which is different than IBM in that we're really one to put analytics and essentially that's where our value prop is is analytics on top of set data, not about moving the data in then kind of growing access to said data. Do you see it the same way, or do I got that right or...
Michael Springgay: Yes. I mean, I think IBM really takes the value is that we're trying to bring the value and charge you for what value you can get out of your data \ movement around of data. I would certainly see with competitors that they maybe make it easy to get it in but then they charge you either to move it between services which, you know, is sort of like double charging it because you had already got it up there and you're charging for that as well as to get it out. They don't really want to move the data back out.
Al Martin: Now and there I guess the other question I had that kind of hits me, you know, is it would seem to me like we've got a campaign what's best (about the) work. It's a revolution to move data from on-prem to the cloud right and get it there. But having said that.
You know, there's a revolution of moving data to the cloud but I think — I don't know what your experience is. To me it's a difficult concept. Even though clients are actually moving to the cloud they've got to open up their firewall — whatever the case may be to get that data from on-prem to the Cloud. Are they willing to do that? Do you see that or do you see a lot of push back or do you see all right now we're going to — as the results we're going to throw it into FedEx because we're uncomfortable with that?
Michael Springgay: Yes I think there are certainly challenges and typically there's internal struggles between the different groups and the (unintelligible) you have to put firewalls open. I find that most cases that can happen. The customers understand. They go to cloud they have to open up some of the network.
But there definitely is a challenge about how much of the bandwidth of the Internet connection they pay for that they're willing to use to move data. So that often is a challenge and then, you know, you fall back to FedEx because you can't use all the bandwidth.
Al Martin: How often are — does that fall back to does this happen?
Michael Springgay: You know, not all the time. It really depends on how much data we're moving. So I do think it probably, you know, we can probably move better data with Aspera than maybe a customer is willing to take. And so they'll fall back to FedEx earlier then maybe we would say that you need to.
Al Martin: Rajani does that match your customer case points that I know you work on a lot of customer...
Rajani Maindiratta: So we — Aspera we just sent out with our product for almost two years now. So it's been IBM's only technology in (unintelligible) cases.
Michael Springgay: You know what we did? I think IBM acquired Aspera like a year when nobody knows it right, remember?
Al Martin: Couple years.
Michael Springgay: Couple years at least.
Rajani Maindiratta: Yes and then we integrated it into data on cloud.
Al Martin: So it's kind of the backbone of everything we do at this point in time.
Rajani Maindiratta: It is yes.
Al Martin: All right, what separates that from just any other technology? Why would we do that?
Michael Springgay: Well I think, you know, it's the commercial version of some technology that's been out there and research. But they've really made it simple, you know, because it's all there. It's set up for you as part of the product. We don't have to just install a client. You don't have to (unintelligible) a server or we're managing the resiliency of it which is some of the other (unintelligible).
Al Martin: But the speed, you know, as I'm working with a team, speed is amazing. So (unintelligible) on difference than just the normal data movements. I mean there's some magic in there some place.
Michael Springgay: Yes so they're using a UDB versus TCB. So they're building in on top of the network. They're not relying on sort of your operating system for the failure. So they have their own algorithms to figure out how to split the file up and send it in parallel and put it back together and make sure the pieces all come correctly.
Al Martin: So this is like Silicon Valley. And I know you've probably never seen that show.
Michael Springgay: I...
Al Martin: It's the same thing.
Michael Springgay: Yes it's the — yes. They are basically applying heuristics at a higher level to basically exceed the standard bandwidth limitations.
Al Martin: Because we've talked a little — well a lot of what we're — I think some of it's just an overall point of view but some of its IBM technology. What do you — is there any more you can say about the industry in this space, I mean things that where the industry is, where you think it's going where — you know, I mean whether — where Amazon is, where Google is? Is your — any comments to any of those areas?
Michael Springgay: Amazon for sure is certainly marketing the whole disk transfer service. Like they have names and everything and they definitely grow in the size of which appliances they have. They have appliances for small, medium and large whereas, you know, we're trying to at IBM at least not have to have a smaller version available through the Aspera technology.
Google is I think in the same area but they're probably not in (cure) in the disk transfer. They do have disk transfer services but it's not quite as robust as say AWS for sure. I think everyone's trying to figure out the network bandwidth solution. I know AWS recommends some open source technology that is similar but not quite as sophisticated as Aspera do the technology but they haven't really necessarily invested directly in how to do that to make it easy for the customer, sort of that one button click quick thing they're telling you how to basically build up different networks and servers to solve a problem.
Al Martin: Where we going in the future with this? I mean is just faster speed, speed, speed? Is that it and the one-click button or is there going to be something that you see in the future that's really going to transfer the industry yet again around data movement? It's kind of like there's a — an HBO series called Silicon Valley and they're really on compression which enables at least the way I understand is they — enables, you know, expedited transfers which is it's a show right? But with that compression they differentiate themselves in terms of their ability to transfer at speeds that nobody else can do. Now where do you think we're going either...
Rajani Maindiratta: (Unintelligible) commented (unintelligible). He didn't — he doesn't get overall that they're moving into the warehouse, updates of the (unintelligible). That's a huge focus.
Al Martin: Yes, but how do you get that performance, through compression through...
Rajani Maindiratta: Compression is a (unintelligible) yes, just more efficient, effective use of parallelism as well is something — an area to focus on as well (unintelligible).
Michael Springgay: I think the other thing that people are less interested once they get the data in to moving it back out. So, you know, you would see an ETL process taken in the past where the data qualities weren't that big so people would bring it out into a middle tier tool to do transformations and put a pack in. Now you really see customers pushing hard to get all of those transformations done as much within the database server so that the data movement piece is minimized.
So one other aspect is trying to avoid it (unintelligible) because you can keep the data in as many locations as they are and only move the smaller pieces that you need to make business logic grow.
Al Martin: All right. So some of what we've been talking about I think we've, you know, again an overall point of view, anything that IBM's doing differently that you guys had call out in terms of whether it's performance, ease of use, real-time data access? Any of those, compression...
Michael Springgay: I think it's really around ease of use and speed. I mean really with the embedded Aspera in our cloud technology we're trying to make that simple versus, you know, I mention Amazon who sort of points you in places where you can speed up your networks but you have to set up all the pieces. So I think we're trying to make it simple.
Al Martin: Simple. Any other questions or any other answer that you'd give Rajani?
Rajani Maindiratta: Yes simple and just understanding our customer as well too I think is something we're going to have to (unintelligible) in the sense of we take feedback from our customers and drive improvements within the product.
Al Martin: If I'm a client what kind of speed can I — I mean I know it's dependent on compression, data size or lots of different things. What kind of speed? You may have hit this before but what can I expect?
Michael Springgay: So I think from, you know, around the cloud we're looking about 200 gigabits an hour range with the (lift) technology and in Aspera. On-premise though we can see, you know, 4 or 5 terabytes an hour if we have — has enough network storage attached to the machine with the (unintelligible).
Al Martin: Good. So if the listeners want to learn more about, you know, data movement, our point of view, IBM's point of view on data movement where would they go?
Rajani Maindiratta: They — we have some kind of knowledge center I believe has a lot of information there. Some of our recent announcements around the mass data movement advice and the (list field) I think there's announcement. If you go look at the IBM announcements page there's a blog around those technologies that we released.
The Knowledge Center I think also has information about external table support. That was one of our (unintelligible) from the performance around data load. So not (unintelligible) your blog (unintelligible) the mix as well as I think those are the main...
Al Martin: Good. Well any ones that she might have missed Michael?
Michael Springgay: Well I would just say for ground cloud, you know, the Lift Landing Zone out there on...
Rajani Maindiratta: Sure.
Michael Springgay: ...IBM Cloud is probably the best place to go first. That's where we want you to go and that's going to make it the simplest one. But or closest to one click that we (unintelligible).
Al Martin: We'll also get those in the show though. So look anything else that you say where I like to do a lightening round just for fun. It's like a personal questions...
Rajani Maindiratta: Sure.
Al Martin: ...because I get to know people. And also I'm going to ask what book you're reading or what you would recommend because I read a lot of books and so I just know that I'm going to go there. But anything left on the data movement that you'd say that you feel like we missed or left unsaid or did we get it?
Rajani Maindiratta: I have a question. What was everyone else doing at IBM (unintelligible) that we may want to be doing then?
Michael Springgay: That's a good question.
Al Martin: Okay.
Michael Springgay: Not sure there is anything that we're not doing. You know, everyone's sort of doing that same thing. We never (unintelligible). I guess the real thing...
Michael Springgay: ...IBM is focused on that bigger device, the one that we may need is smaller things like the Snowball.
Al Martin: Snowball, explain that.
Michael Springgay: So Amazon Snowball is a 50 terabyte device. It's pretty neat because it's a small box that you can (unintelligible) versus, you know, our last bit of — and so it's a fairly large appliance. So I think just from the ease of use for those smaller customers that aren't in that huge terabyte, 100 terabytes, petabyte system it's a slightly more convenient.
Al Martin: Yes but well what's the customer profile we're typically working with? Are they in that smaller domain or are they back into the domains that we're...
Michael Springgay: I think the challenge is is that the customers we're working with will ultimately need (masked in) movement. But when you're talking about a POC or something like that when you're starting area is a much smaller dataset that they're trying to move over. And so just maybe gives us more flexibility in the early stages.
Al Martin: All right cool,
So look I like to do a little lightening round and then Rajani I'll start with you.
Rajani Maindiratta: Sure.
Al Martin: What's the most exciting thing you're working on right now? Anything that you'd call out that'll get you up every morning?
Rajani Maindiratta: I think it's around this area (unintelligible).
Al Martin: Yes, but give me some specific in terms of around this area the data movement.
Rajani Maindiratta: Update performance. That’s kind of the area I'm focused on.
Al Martin: Is there something that makes update performance more challenging than insert performance, delete performance...
Rajani Maindiratta: It's very similar kind of development space. It's just that I think one of our benchmarks has been against the appliance. And we — we're trying to get our speeds to kind of up — on par or better than that. So there's challenges just like as in the load area but I wouldn't say it's worse - any worse or better.
Al Martin: Fair enough, fair enough. How about you, Michael? What gets you up in the morning?
Michael Springgay: Yes, so, you know, my day job is to focus on both (external) tables and compatibility. But I find external tables very interesting in the sense that there's a lot of different file formats out there that people need to shape. And the more we can broadly support exactly what they have without having them make them do transformations is makes some so, you know, making sure we can figure out how to get the right set of options to support the most broad set of data that there is.
Al Martin: Is it — I know you'd be studied. In other words you do your research. Where do you - where does the data movement expert do his research?
Michael Springgay: A lot of it comes from customers. You know, I work with a lot of customers. So we see what their data is and understand exactly how they have laid out their data.
Al Martin: What Rajani what if I want to go do a crash course on data movement? Any place you have me — direct me outside of the locations you already have at IBM?
Rajani Maindiratta: Just some internal sites focused, you know, that are (unintelligible) the sales people...
Al Martin: And best practice?
Rajani Maindiratta: Yes, yes best practices. But there's — if you just Google there's lot of articles I — articles around data, you know, big data challenges...
Al Martin: Yes.
Rajani Maindiratta: ...and then the big movement space.
Al Martin: How about books?
Michael Springgay: So I — when I do read it's usually for my own enjoyment so I — but I'm usually reading books that are like...
Al Martin: That's all right.
Michael Springgay: ...college many years earlier so I like — I want old detective books.
Al Martin: Old detective, are you...
Michael Springgay: I'm reading the Bosch series right now.
Al Martin: The Bosch series?
Michael Springgay: Michael Connelly.
Al Martin: Oh let's see, all right what about you? Anything...
Rajani Maindiratta: Yes a couple of books actually. We reading Ghandi's biography, I read pieces of it before, but never got through so I started reading that. And then there's a coaching and leadership...
Al Martin: Ghandi’s biography.
Rajani Maindiratta: Autobiography.
Al Martin: Autobiography.
Rajani Maindiratta: Yes, yeah.
Al Martin: What have you learned out of that?
Rajani Maindiratta: Oh just kind of perseverance.
Al Martin: Would you recommend it?
Rajani Maindiratta: Yes, definitely.
Al Martin: Yes, I recommend.
Rajani Maindiratta: Yes. And there's a coaching famous leadership book that I forgot about.
Al Martin: Fantastic. I like books. I got a list. I can — you know, my list is never exhausted but I keep going at it.
Al Martin: All right, fair enough guys. I thank you for taking your time today. I appreciate it but talk to you next time.