Blogs

Reaching Near–Real-Time Data Replication: Part 1

Replication swiftly synchronizes geographically dispersed systems, isolates workloads, and helps avoid downtime

Today’s interconnected systems and mobile computing advances compel many organizations—including financial institutions, insurance companies, manufacturers, and cloud-based services providers—to move data between different database management systems (DBMSs), locations, and geographies. These organizations give top priority to continuously serving their clients with current data in a timely fashion, without any downtime or availability limitations.

Replication technology is one important component that organizations can deploy to handle the demand for moving and synchronizing data. In simple terms, replication is a method to synchronize data between two or more databases. Compared to other data movement techniques—such as extract, transform, and load (ETL) or DBMS utility operations—data replication focuses on changed data that is usually extracted from the source database system with very efficient log-based access mechanisms.

Replication techniques cover a wide range of solutions, methods, and topologies. Advanced replication solutions for relational database management systems (RDBMSs) read the database recovery logs to track insert, update, and delete operations—data manipulation language (DML). They also track common structural changes such as adding new columns, changing column data types, or adding completely new tables—data definition language (DDL). The detected changes are transferred from the source to target database systems.

Motivation for replication

The reasons for deploying data replication can vary among organizations. Replication typically ranges from simple one-to-one copies—data offloading for queries and reporting, operational data store (ODS) data feeds, workload isolation, and so on—to lightweight ETL, including data transformations, auditing, and historization to complex multidirectional topologies. Replication also helps client organizations migrate database systems from one version to another, to new hardware, and even from one DBMS to another without incurring any downtime.

The key advantage of data replication techniques compared to ETL tools or utility operations is the low latency and minimal impact that can be achieved when change data events only are synchronized between systems. Advanced data replication solutions monitor replication latency in seconds or milliseconds. They contribute to the elimination of batch windows and cause minimal processor overhead.

Because replication techniques enable near–real-time delivery, they become increasingly relevant for geographically dispersed systems with two or more nodes, even allowing for active-active computing under certain circumstances. Low replication latency prevents conflicts because the data replicates before a conflict can occur. However, if by some chance a conflict should occur, embedded rules can help resolve it, thereby minimizing the risk of update conflicts. Compared to ETL tools, replication disadvantages generally include limited transformation capabilities—usually only SQL-based—and only RDBMSs are supported as a replication source.

Replication requirements

Different replication solutions have different requirements. For example, trade or retail organizations need data in milliseconds, or results of an analytical report must be replicated in real time back to the operational systems. Other organizations may need a stable replication target and want to synchronize on an hourly or daily basis. Some high-availability scenarios need sub-second latency for the recovery point objective (RPO), while others require a delayed replication to prevent the synchronization of application or administrator errors before they are detected and corrected.

Continuous operation of replication solutions without any downtime is another common requirement—for example, when administrators add or remove tables or columns. Advanced replication solutions should be able to replicate data not only to databases, but also to other applications, web services, files, ETL processes, or big data platforms.

Database models often contain complex business logic based on referential integrity, stored procedures, triggers, and so on. A replication solution should manage the complexity to replicate data in the correct order without any business rule violation. Replication techniques typically have to bridge heterogeneous database systems and operating systems.

Part 2 of this series describes in detail the most common replication use cases. Please share any thoughts or questions in the comments.

[followbutton username='federator3' count='false' lang='en' theme='light']
 
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']