Providing transactional data to your Hadoop and Kafka data lake

Offering Manager, IBM

The data lake may be all about Apache Hadoop, but integrating operational data can be a challenge. A Hadoop software platform provides a proven cost-effective, highly scalable and reliable means of storing vast data sets on commodity hardware. By its nature, it does not deal well with changing data, having no concept of "update," nor “delete.” The power of discovery that comes with a schema is also missing, which creates a barrier for integrating well understood transaction data that is more comfortably stored in a relational database. So, how can you get the most from your data, including structured business information and the very high volume and unstructured data from social media, internet activity, sensor data and more? 

Making constantly changing data available

Contrast the limitations of Apache Hadoop with Apache Kafka. Designed from the outset for constantly changing events and data, Kafka is rapidly becoming the enterprise standard for information hubs that can be used with or to feed data to the data lake. Using commodity hardware for highly scalable and reliable storage, Apache Kafka goes beyond Hadoop with a schema registry, self-compressing storage that understands the concept of a "key,” and other characteristics that assume data will change. Dozens of "writers" and "consumers" build to this open standard, empowering integration of transaction and other rapidly changing information with enterprise data stores, processing platforms and more.

With data amassed in a Kafka-based information hub, Kafka consumers can then feed data to the desired end points, including

  • Information server
  • Hadoop clusters
  • Cloud-based data stores

Alternatively, Kafka-consuming applications can perform analytics functions using the data amassed in the Kafka clustered file system itself or for triggering real-time events. For example, a Kafka Consumer application that subscribes to relevant Kafka “topics” could send a “welcome back” email to a customer in response to data replication indicating that a long-dormant customer just accessed their account.

Making more real-time data available to such enterprise data lakes or data hubs is an ongoing challenge. For better decision-making and to reduce costs, enterprises need to capture information from source transactional systems with minimal impact, deliver changes to analytics and other systems at low latency, and analyze massive amounts of data in motion.

Delivering real-time transactional data replication

To help organizations deliver transactional data into Hadoop-based data lakes or Kafka-based information hubs, IBM Data Replication provides a Kafka target engine that streams data into Kafka using either a Java API-based writer with built-in buffering, or a REST (Representational State Transfer) API using batch message posts. Alternatively, IBM Data Replication can also deliver real-time feeds of transactional data from mainframes and distributed environments directly into Hadoop clusters with its Hadoop target engine using a WebHDFS interface.

Learn more by reading the IBM Data Replication solutions brief on how transactional data can feed your Hadoop-based data lakes or Kafka-based data hubs.

If you’re ready to explore real time data replication, reach out to your IBM sales representative and business partners; they’d be happy to speak to you more about the benefits of the IBM Data Replication portfolio.