The art of designing data flow on a free-form canvas

Program Director, Engineering, Data Connect, IBM

Do you feel like there is data everywhere, but no data that’s really usable? Well, look no further. IBM Bluemix Data Connect announced a brand new Design Data Flow beta capability to help you easily consume large volumes of data coming from disparate data sources in the cloud or on premises. The data can be curated in a self-service fashion through a series of operations and written to a target that can also be in the cloud or on premises. You can then derive trusted insights from the prepared data using your preferred analytics tools. automation to data flow 

Now, you might be thinking that any extract, transform and load (ETL) tool can do this consumption, curatorial and deriving insights function. Well, imagine if each step in the flow ran automatically as soon as you added it, and you can see the state of your data instantaneously at any point in the flow? Further, imagine that a built-in, self-service data preparation capability exists that provides an innovative way to work with a data set to cleanse it and improve quality?

Consider a simple scenario that demonstrates how IBM Bluemix Data Connect works. Say you have a customer that procures information in a customer data set in an on-premises IBM DB2 database, and salary, contact and geo information in a prospect data set in cloud-based dashDB. Also assume the customer procures sentiment information—Twitter data—in a sentiment data set in an object store. The goal is to find those customers who are good prospects for future promotions and sales. 

The Design Data Flow beta capability introduces a free-form canvas that enables easily designing data flows leveraging the power of Apache Spark in the IBM Bluemix cloud: 

  1. Create a connection to each of the three data sets. IBM Bluemix Data Connect supports a variety of connectors; chances are whatever your data source is, you’ll find a connector for it.
  2. Launch the Design Data Flow beta feature to add the three data sets to the canvas using a simple asset browser.
  3. Join the customer and prospect data sets to produce a single data set.

The flow runs automatically, and you can preview the sample result right away. Because the sentiment data is coming from online sources, it probably needs a lot of cleansing. You can prepare the data by launching an integrated, self-service data preparation capability to standardize phone numbers and addresses, convert the data types, remove unwanted data and much more. You can then join this data with the single data set from the customer and prospect data sets. You can further filter the data by salary or geo, and then remove duplicates and sort the data.

Applying analysis, reporting and more

Again, you can validate sample data at any point in the flow to make sure it looks good. Then just write the result data set to a target. An activity saves your action. It can be run immediately or scheduled to run on a regular basis to analyze new data as it becomes available. You can then also use analytical tools to build reports and send them to your sales force for execution.

Read more about IBM Bluemix Data Connect (formerly IBM DataWorks), and be sure to register for IBM Insight at World of Watson 2016, and join us at Session 1653 to learn more about this new and exciting capability. 

Discover data flow design at IBM Insight at World of Watson 2016