Making Data Simple: What is Data Gravity?

Making Data Simple: What is Data Gravity?


Moving data often impacts system performance, so how do you move large volumes of data safely and securely? The importance of data movement is even more critical when you consider moving data from ground to cloud. Joe Bostian, IBM Z data science architect with IBM Analytics, and Mythili Venkatakrishnan, distinguished engineer and IBM Z Analytics technology lead with IBM Systems, join the Making Data Simple team to discuss how to move data, run analytics and maintain performance.

Show Notes

00.30 Connect with Al Martin on LinkedIn and Twitter

00.35 Connect with Kate Nichols on LinkedIn and Twitter

00.35 Connect with Fatima Sirhindi on LinkedIn and Twitter

00.40 Connect with Joe Bostian on LinkedIn and Twitter

00.45 Connect with Mythili Venkatakrishnan on LinkedIn and Twitter

01.00 Learn more about IBM Z.

02.30 Learn more about Software Engineer Dave McCrory here. 

03.00 Learn more data gravity here. 

06.00 Learn more about IBM IoT initiatives here. 

06.07 Learn more about Amazon Alexa here. 

26.00 Learn more about GitHub here. 

Ready to dig deeper? Check out our previous podcast episodes of Making Data Simple.


Al Martin: Welcome to Making Data Simple. This is the podcast all about data. Thanks again to my producer Kate Nichols and Fatima Sirhindi, they do a fantastic job; they don't get enough credit. Today we're gonna talk about data gravity. And I have two guests, Joe Bostian and Mythili Venkatakrishnan. Hope I got that right, I had to slow it down there. Joe is a Z system data science architect and Mythili is an IBM distinguished engineer on Z. Why don’t I turn it over to you guys to introduce yourself. Interesting we're gonna talk about data gravity you guys are both on the Z side. I've got to believe there's a connection there in an interest, but Joe why don’t I give the mic to you first. 

Joe Bostian: My name is Joe Bostian. I've worked on analytics on system Z for the last two and a half years or so. This is a project that Mythili was really sort of the driving force behind in the early days, and so I joined the team after it you know we got up and running. I've been with it for the last couple years looking at architectural issues on the platform and architectural issues in the analytics environment in general, and there are a lot of truths about data science architectures that are platform agnostic and you know, hopefully, we will cover a lot of this here today. 

Mythili Venkatakrishnan: Great, Mythili Venkatakrishnan I also work in IBM's Z brand organization and I lead architecture for analytics on the platform. Now you're right there's a lot of connectivity between data gravity, which is focused around where the majority of the data originates from for analytics and Z, because many of our clients have a significant amount of data that is contributing to their core business assets that are actually on Z. So that's how that the two are connected. 

Al Martin: Thank you, and again I hope I didn’t butcher your name too bad, my apologies. Let's jump right into data gravity and I have to say that the term itself, when I first heard it several years ago, I rolled my eyes at it. It was kind of cute, if you will. It's another term that the industry tries to push at us.

Here is what I know, I know it was coinedby a software engineer David McCurry, and it's really about data that obeys rules like mass does, like physical mass does. I guess that what that means to me is when I own a business or as a business owner, is that data essentially attracts other data, whether it’s applications or whether business logic, and the larger that mass, the more data or those applications are moved to that mass. So, my approach and design of products and hybrid data management that I'm responsible for is always the concept of moving analytics to the data. Hence, I guess, under an umbrella of data gravity. So, that's kind of my high level definition. I'd like to get your definition and really correct everything I just said that was wrong. 

Joe Bostian: Well, I think we generally agree that data gravity attracts applications and services to the data -- bring your compute to the data is another way of saying it -- so that you take and process data more or less where it resides. There's also been a history of where data is often collected into a central repository, and that can scale up to a certain point. But, what we're trying to say with data gravity is in the sort of the original idea posited by David McCory, it's applications and services that move close to the data. And, data aggregation is sort of a separate topic that is informed by the attraction of these applications and services, but it doesn't necessarily mean that data always attracts more data per se. 

Mythili Venkatakrishnan: And I agree with that I think the other part is you mentioned moving analytics to the data and to keep our updated gravity and I think that is a very useful definition as well. One of the challenges that clients have is understanding where, if you will, most of the data is. And, it really depends on the use case. So, we apply data gravity when we're talking with clients about, you know, where do I move my analytics applications to when I want to use data that is in multiple environments and you can use concepts of data gravity -- either in terms of the rate of change in the data, the volume of data, the presence of other transactional environments that use that data -- to help define where you're going to move your analytics to. Where is the gravity for that use case.

Al Martin: Another interesting way of looking at it is if I find that as the data is manipulated by software, it tends to generate more data, which is a form of context that when, particularly when data is in a public space, it gathers more and more context as it interacts with other data. I saw this referred to in something I was reading, it was called “gathering moss.” I don't know why -- maybe it's another term, another cute term, but all of these observations, in my mind, and you guys correct me if I'm wrong here, but it affects the tendency of data to be centralized for cost, efficiency, proximity to each other source of the context, so that you can have it ease of which it can be moved around or repurposed. Do you see it the same way? 

Joe Bostian: Yeah, I think the data tends to aggregate and attract other data up to a certain point where it becomes sort of difficult to manage when it gets to a certain scale. I think a lot of times what makes sense and works really, really well within a certain scope has its limitations and it gets to the point where you say wait a minute now aggregating data any further than we already have is gonna get us to a point where we spend a lot more resources or we have to start changing the way we do business because simply the infrastructure structure we've got won't handle the additional load. 

In a way, if you think about this, this is a cross-platform concern. Let's take for instance a lot of the IoT types of devices that are out there like Alexa and smart home speakers and things like that. If they sent raw data verbatim back to be analyzed, and then did analysis so that you could come to some kind of an event that you wanted to get over to occur within the user's home, that wouldn’t be a very good way of doing things, right? Clearly that device needs to analyze the data to a certain level there on premises, within the home, to use a technical term, right? So, the idea of of bringing analytics and application to where the data originates is sort of a cross platform sort of concept. 

Al Martin: I have an IoT question here in a second. I want to turn it over to you, Mythili, but let me say this. To me data is just data. What matters is what you do with it. You know, we're in the information technology, not dated technology, as much as I love data, but enter the need for data gravity. Why do you say, what's importance? Can you characterize better the importance for those that are listening?

Mythili Venkatakrishnan: The importance of data gravity? 

Al Martin: Yeah data gravity, why does it matter? Why are we talking about it? Why is it a term?      

Mythili Venkatakrishnan: Right. So, it's become increasingly irrelevant, I think, as the volume of data that feeds analytics is grown for many of our clients. Across every industry the variety of the data has grown. And, so, increasingly, our client, from what I've noticed, are using a more federated approach to access a variety of data and then locating the analytic application where there is data gravity for that use case. So, for example, if you want to blend in information that you own that runs in your transactional systems around, credit card information, client  information, vendor information, etcetera, merchant info, and then you want to also include in there aspects around what your clients are tweeting about or what’s out on social media, you can do that. But where will you locate your application, because of there's a variety of data and, and, as Joe mentioned, moving it at all to centralized locations is becoming untenable for these cases because of the need for real-time information because of the need to really handle huge amounts of data. So, the question around data gravity is really important when you look at how are you going to deploy this use case and where will you locate your analytics and where is closest to the data and what does that actually mean. So, I think that's why it's becoming increasingly important to understand.  

Al Martin: Is it because of the real-time analytics you mentioned, or is it because, you know, data transmission is a form of friction? Meaning if you... The more you have to move data, which I want to talk about as well, you know, then you've got throughputs, bandwidth performance issues, or is it both? 

Mythili Venkatakrishnan: It's really both, but the need for real time is something that we've noticed. It's definitely there. And real time means not just how fast you can run an algorithm, but also it also means how current is the data you're using to generate insights? And, can you generate those insights at the point of impact and make a business decision with that? And so if you rely on data movement as the only source on which to run analytics, then you have got that increased friction, you're going to have those delays, you're gonna have increased costs; but, in addition, it's very likely that you're going to be using stale data. 

Al Martin: What kind of projects are you guys working on as it relates to data gravity? 

Joe Bostian: Well, we are both working on the same offering, open data analytics for Z, running on Z/os. To address analytics Z/os, one of the other things that we can explore, one of the other areas we can explore, is data governance and also the fact that a lot of times on your enterprise systems you will have your data stored your enterprise data stored with personally identifiable information and you will need to either mask or hold back that data from your application developers, from your data scientists who need access to a database, but probably don't need access to things like social security numbers and credit card numbers or cell phone numbers. Our concerns and our requirements that are built around data gravity for analytics on system Z -- one of the advantages that we can bring is that by handling the data on platform with all the governance rules that are already enforced for the enterprise, you can then continue to keep a lot of those policies and rules enforced, analyze your data and then results can be scrubbed of sensitive information that perhaps you don't want to aggregate in another location.

Al Martin: So, essentially what you're working is on some privacy, almost fraud detection, and using data gravity, or the concept I should say of data gravity, as to keeping data in the Z platform or wherever in management there, so is that what you're saying?

Joe Bostian: Yeah, we also have to recognize too that the data movement usually means data copying, right? So, what do you do when you move your data from one source to another? Most often you don't delete it from your original source. Now you have two copies and if you've got a lot of enterprise data and multiple copies, even if you're not concerned about PII or other types of sensitive information, now you've got two copies. Which one is going to be your reference copy? You have to manage those kinds of things. So, we are doing all of that kind of enterprise data management on the platform. 

Al Martin: The reason the movement is very interesting to me is that it kind of goes back to a quote -- and I am going to butcher this quote -- the quote that I had from Microsoft researcher Jim Gray where he said, “compared to the cost of moving bytes around, everything else is free.” Meaning you don't want to move data if you don't have to move data. So, any more, you know, Mythili, do you have any comments as to the data movement and its impacts in this and why that lends itself to the concept of data gravity? 

Mythili Venkatakrishnan: Yeah, I mean I think the data movement in and of itself is certainly costly. Our clients, some of them have shown thirty percent of their capacity they have, from a processor perspective, just moving data. Moving the data hasn’t gotten any business insights. It is sort of the first step that they see, but it hasn’t delivered you anything and you're spending thirty percent of your cycles just doing that. And, so the concept really is around, rather than spending those cycles and processing power moving the data, really apply it to doing the first round of analytics where the data is produced.

So doing that at the source and center it around this notion of data gravity can give you both business benefits, certainly as well as cost benefits. And I think across the board we've been working, you know why would we consider this for Z really centers around the question of what's the data that clients have on Z. And there is a significant amount of data, so we are very much focused on materializing these concepts around data gravity there. 

Al Martin: So, Joe, you mentioned IoT and that was one of the questions I had. Maybe you already have answered this, but if I look at, by example, a jet engine, there's gigabytes of data that's being taken and sending that back to a central location. Often it can be difficult at best. Or, when you're talking about an autonomous car, you've got to make an instant decision. Or a train, you've got to make an instant decision. You want to make sure you can do that. It goes back to, Mythili, to your comments around, you know, you gotta make it real-time analytics at the device. Any more to that? Any more concepts that the users should think about as it relates to IoT?

Joe Bostian: You know the choice to reduce the data and doing the initial level of analysis on device is just sort a natural one. If anybody's going to write, you know, any kind of an application that runs on a device, an IoT-styled device, unless it is a really primitive device, or basically just a sensor, you're going to do some kind of all of data reduction of manipulation at the source, rather than, you know, than just pipe data over to a central server. If you don't do that, then all you have is sort of a dumb sensor, right? 

So, in the real world of IoT I think there our devices are only gonna get smarter over time, right? Certainly, a phone is a perfect example. They're gonna continue to aggregate function and process their data on site, and really that same concept is analogous to what we do on the mainframe as well. I think one other too thing to remember is the nature of the way the data is aggregated. If you have a device, an IoT device, that's streaming a stream of data, your logistics around aggregating that are different than if you go ET a large batch from another machine, right? So, they present different challenges, but in the end the data gravity concepts are the same. They might appear to be two different problems; but, really, they're just different facets of the same issue. 

Al Martin: Mythili you got anything else to add or…

Mythili Venkatakrishnan: I would say that we're seeing an increasing interaction between some of the structured data that we would typically have on core transactional system and IoT information; but, as Joe pointed out, consuming those raw pieces of input is really untenable. So, as those devices perform that data analytics and the aggregation, they can actually notify some of the more structured transactional environments and the analytics that are there to cause additional information and insights to be generated. So the interaction across the structured data that largely resides on capabilities like the IBM Z, combined with information and insights that are generated by aggregating that IoT information, I think will be a very meaningful path in the future. 

Al Martin: So I think, look we've talked about IoT, we’ve talked about data movement, we've talked about the need for privacy, even maybe some legislation in there, that kind of stuff. What have we missed? Is there any specific examples or use cases? You talked to one, Joe, in terms of fraud, I believe, you were talking about earlier. Any other use cases that you're either working on, or are worth mentioning? 

Joe Bostian: Well, I might defer to Mythili on this work. I'm more the technologist and she understands the business climate better than I do. I think she has a better handle on valuable business cases. 

Mythili Venkatakrishnan: Sure, so in the financial space, I think we discussed fraud or fraud detection, specifically in the area of enhanced credit card fraud detection using advanced analytics. Many of our, you know, large financial institutions of course have advanced detection for fraud. Many of them are rules based, so when incorporating advanced analytics or real-time analytics as part of that flow to determine when a transaction might be fraudulent, I think is something that we are seeing actively when working with clients in that area.

In the insurance industry, healthcare, for example. Being able to look at each claim as they’re coming through and rather than have to claim go through the automated claims adjudication systems, justifying rules-based technology incorporate predictive technology so that you can determine whether there is a likelihood that the claim will be appealed, if it’s denied the likelihood that it is going to be appealed successfully if its denied, and enable the insurer to take different business decisions. 

And, finally, in the retail space there's a lot of interest in looking at real-time analytics for logistics and identifying when there are supply issues with respect to the supply chain. When there are issues regarding not just supply, but issues in some of the external data that might need to be combined in order to get business insights. So, there are many use cases and they really span across industries. 

Al Martin: Very good. In my business, look, I'm working to bring compute to the data and avoid movements. Hence the reason, we're not called just data management, it's hybrid data management because we need analytic solutions that provide a hybrid solution that connect to the data where it lives. In the concept of data gravity, how do you guys apply that to your day to day though? It's kind of like a principle, I don't know if it's a theory, or, I mean, how do you really take data gravity and put it into action, versus just a cute way to describe putting analytics where your data is? 

Joe Bostian: Well, when we work with some of our customers, one of the first things we do when we go through a proof of concept is to work with them to identify what kinds of data is already available on the platform, identify what they want to do with it and then we can work with them to shape the results that we want to present. How do they wanna present it? Who is their audience? And, you know, how do we get from raw data to the finished products, so to speak. So, once we can identify those kind of parameters for a particular environment then we can work with the customer to understand in the larger architecture of their environment; because they're never, you know, based on one system, there's always a network of systems throughout the enterprise. We can help them decide then what we gonna do these results.

Do you want to present these to make a business decision, you know, to a business leader within the company and that's the end of it? Do those results then get archived and integrated with other data lakes that are off platform? Maybe, perhaps, you wanna do historical analysis or trend analysis, something like that, and we work with them to help to try to make those kinds of architectural decisions about how they get from raw data to a sort of final resting place of the results and archiving. So, we try to help them with the entire work from beginning, all the way through the analytics cycle to archiving their data results. 

Al Martin: What do you say, I mean, we've already talked about all of the different challenges with the business, but what are the challenges that you're running into when you're working with these clients? Is it the path to cloud of trying to get there and data movement being maybe a problem? Is it trying to make a truly a hybrid business scenario, so that they can, you know, keep their central repository where they have them and still grow within the cloud? I mean, what are the biggest challenges that you're facing right now?

Joe Bostian: Well, from my point of view, just helping them understand the technological environment they’re working in is the first initial challenge. And that's generally where, you know, I work with customers up to a point where, you know, we say what are all the technologies available? What's the best choice for you? What fits the skill set for your application developers and perhaps your data scientists? Mythili works with the customer more at the big picture level, the business level, but from my point of view I'm trying to help customers at the technological level, so that at a particular level, so that they can they can complete a step in their work flow. I don’t know Mythili if you wanted to talk about the business decision part of it.

Mythili Venkatakrishnan: Yeah, I mean I think we do have challenges. Just in terms of organizational challenges, sometimes when you're talking with the enterprise architecture team, they might really understand the concept of data gravity and find use cases that fit that very well. Then as that transitions into needing to include the data scientists from the client or finding that skill in an organization, sometimes those kinds of challenges can really present themselves significantly in a project. Other times it is the technology in pulling all that together, as Joe talked about, but the initial phase in terms of getting these projects started with the client can often be more organizational. 

Al Martin: Any other research that you would recommend or locations that our audience can go to learn more about data gravity? 

Joe Bostian: There are some original blog posts by Dave McCurry in 2010. We also have a lot of Z-based analytics information, that we can make available to people, that sort of embody the whole concept of data gravity. Pretty much everything we do is based on architectures that are defined by data gravity.

And if anybody wants, you know, more details about, you know, what we're thinking about this, they can always contact myself or Mythili. 

Mythili Venkatakrishnan: And there is also a Think blog by Forrester. Michael Cherry from Forrester has written a couple of papers around data gravity as well that can shed some light on that. 

Al Martin: As I am talking with two Z gurus, the one thing I think you guys did a fantastic job of is bringing analytics to the data. This is data from Fortune 100 and Fortune 500 clients; they have a ton of critical data residing in Z. So I know a lot of the data science experience that is embedded within Z. I could go on with a a number of latest technologies that you have within the Z platforms, but is there anything that's worth mentioning on your side?

Joe Bostian: Well, I think a key takeaway you touched on is the data science experience. There are a couple of different offerings that we have on the platform depending on who it is, you know, that wants to exploit analytics capabilities on the platform. There is open data analytics for Z, which is more of an application developers interface. Then there's a higher level machine learning for Z that includes a large part of the data science experience and presents sort of the machine learning work flow interface to the user and system administrator. Just one other thing before I forget. One thing that we really try to emphasize to our customers and to anybody who's interested in analytics on the platform is that we're trying to present mainframe system Z as just another node in your overall analytics environment. It's certainly a very capable platform to do things on, but we're not coming forward and saying, you know, run it on Z or bring it over from another platform. 

Z fits very well in the ecosystem of enterprise data, and is, in a sense, a base for that, but it's also very good at fitting in equally well with any of the other analytics infrastructure that's already grown up in a lot of shops. So, when we talk to customers we don't go in and say re-purpose and re-host your data lake on Z. We say federate the data you already have on Z with all of your other analytics resources within your enterprise. And Z can fit in very almost seamlessly into the rest of your environment. 

Al Martin: Well stated. So, Mythili, Joe, anything that I didn't cover that we feel like we should cover before we break here?

Mythili Venkatakrishnan: I thought it was a pretty thorough discussion. 

Al Martin: All right, very good. Hey, so, as we talked before the podcast, I always finish with a little bit -- it's not too personal, I promise you that -- but a couple quick questions in terms of you know you personally. 

So let me ask this, you guys are two leaders in the organization, as I know it. What do you see as the most important habit for leaders to develop to become more effective, or the best advice in leadership that you ever got? Leave us with some wisdom.

Mythili Venkatakrishnan: I think some of the best advice I got was to stay persistent. And if you believe in what you're building and in what you're doing and in the value you're bringing to clients keep persistent and don't take no for an answer and try to find a solution. 

Joe Bostian: And I think, if anything, this particular project has taught me, it's to be open minded when it comes to decision making. If you look at a platform and there is a solution that strikes you as something that you're going to pursue, still be careful and make sure that you've considered all the possibilities, right? Because especially in the open source community and the environment we live in today, there are often a whole set of solutions to the same kind of problem. So, make sure you consider all of your different options for making a decision and moving on. 

Al Martin: Well, while this podcast is absolutely the best, are there any other podcasts, or any other collateral literature, you know, books that you've read as of late around technology that you would highly recommend?

Joe Bostian: There's so much information out there. I mean we work, as part of our environment, we work pretty closely with the anaconda community, and python in particular. And I spend all my time digging through GitHub repositories, stack overflow and all sorts of documentation repositories that are scattered all over the web. I'm finding that there's no one resource I go to. I probably reference half a dozen different ones on a daily basis.So, it's hard for me to recommend any one source.  I would say if you stay close to stack overflow, to GitHub and any of your conventional documentation resources, from a technical standpoint you'll get all your information. 

Al Martin: Mythili, do you have anything else to add? Any areas that you'd direct to the listeners? 

Mythili Venkatakrishnan: I would only add to that there is a lot of information and future-looking forecast coming from the analyst community. Sometimes I draw on them to figure out where things are headed and where we need to be positioned from Z perspective. Sometimes to get to all the technical information you can find around in the Open Source communities, I think there are a vast amount of publications from, you know, IDC or Gartner or Forrester, that help to articulate where each of there points of view lie in terms of where analytics is headed. That gives us some headlight into where we might be thinking about going in the future.

Al Martin: Fantastic, appreciate it. I said that the last question was the last question, but I've got one more, and that is, Joe, Mythili, where can the audience find you? We will put it in the show notes, but if they're looking for you either on Twitter or LinkedIn, I assume you are out there?

Joe Bostian: Yeah they can find me. I'm on LinkedIn -- JaBostian. I've been trying to make our presence visible through our open data analytics webpage. If you go there, there are a lot of really good resources that we have there. We have a blog there as well. We are trying to make it a one-stop portal for all things analytics on Z/os. Yeah, there's plenty of information out there.

Mythili Venkatakrishnan:  And I am on LinkedIn as well, and from an IBM perspective, my first name will get you my email. You can contact me that way as well. 

Al Martin: Fantastic. Thank you guys for joining us today. I learned a lot. I appreciate it. And for the listeners out there, thanks for listening, talk to you next time.