CDC using Oracle XStream mode & Talend

Change data capture is an advanced technology for data replication and loading that reduces data warehousing programs’ time and resource costs and facilitates real-time data integration across the enterprise. By detecting changed records in data sources in real-time and propagating those changes to an ETL data warehouse, change data capture can sharply reduce the warehouse’s need for bulk load updating.

Why do you need to capture and move the changes in your data?

• Populating centralized databases, data marts, data warehouses, or data lakes
• Enabling machine learning, advanced analytics and AI on modern data architectures like Hadoop and Spark
• Enabling queries, reports, business intelligence or analytics without production impact
• Feeding real-time data to employee, customer or partner applications
• Keeping data from siloed databases in sync
• Reducing the impact of database maintenance, backup or testing
• Re-platforming to new database or operating systems
• Consolidating databases

How does Talend CDC work?

Talend CDC is based on a publish/subscribe model, where the publisher captures the changes in data in real-time. Then it makes it available to the subscribers which can be databases or applications. 

The Oracle Database records changes in the transaction log in commit order by assigning a System Commit Number (SCN) to every transaction.

Three different CDC modes are available in Talend Studio:

Trigger: this mode is the by-default mode used by CDC components.

Redo/Archive log: this mode is used with Oracle v11 and previous versions.

XStream: this mode is used only with Oracle v12 with OCI.

Benefits of Log-Based Change Data Capture:

Redo/Archive log: this mode is used with Oracle v11 and previous versions

XStream: this mode is used only with Oracle v12 with OCI

The biggest benefit of log-based change data capture is the asynchronous nature of CDC:

Changes are captured independent of the source application performing the changes.

• The additional performance impact on the source system is low

• CDC enables the implementation of near real-time architectures

• No significant changes to the application in the source system CDC reduces the amount of data transmitted over the network

About Oracle Xstream:

XStream consists of Oracle Database components and application programming interfaces (APIs) that enable client applications to receive data changes from an Oracle database and send data changes to an Oracle database.

These data changes can be shared between Oracle databases and other systems. The other systems include non-Oracle databases, non-RDBMS Oracle products, file systems, third party software applications,and so on. A client application is designed by the user for specific purposes and use cases.

XStream consists of two major features: XStream Out and XStream In. 

XStream Out provides Oracle Database components and APIs that enable you to share data changes made to an Oracle database with other systems. XStream Out can retrieve both data manipulation language (DML) and data definition language (DDL) changes from the redo log and send these changes to a client application that uses the APIs.

XStream In provides Oracle Database components and APIs that enable you to share data changes made to other systems with an Oracle database. XStream can apply these changes to database objects in the Oracle database

Reference:

https://help.talend.com/r/6lpaT_M~MEVrltFQcrPe4g/Ni3tTLbdYnf7GCwNT5F5rQ

http://docs.oracle.com/cd/E11882_01/server.112/e16545/toc.htm

Moving data from oracle to HDFS

I divided the scope into two stages:

Stage -1 | Initial data load from Oracle database to HDFS (for bulk load)

  • Here I used oracle native utility (Copy2hadoop) for reducing the friction

Stage -2 | Synchronize data from Oracle to Hadoop (delta changes)

  • Here I used SharePlex, comparatively effective third party tool (Quest/Dell) for data replication

Stage -1 | Initial data load from Oracle database to HDFS

Generally, we have two tools for data movement from the Oracle Database to HDFS

Sqoop and Copy2Hadoop :

Copy2Hadoop has advantages over sqoop, In my case we have complex query(Functions, PL/SQL, Joins) or even views (like a sub-case for complex queries). So, it is recommended to use Copy2Hadoop for the initial load. Since it is oracle native utility it will save our time and effort.

Copy2Hadoop creates datapump files that could have any Oracle datatype which is supported by Oracle External table and store data in Oracle’s datatype. When we convert datatype from Oracle format to the Java format (like sqoop does) there is always risk that some information would be converted incorrectly. Copy2Hadoop insures us from that.

Below are the high-level steps to copy the data:

Note:
Create Oracle Data Pump files from Oracle Database table data that we want to move. 
We can do this by exporting the data from the Oracle Database table using the oracle_datapump
access driver (Using Oracle Data Pump files generated by other database utilities will
not be accessible by Hive tables.  We need to use the oracle_datapump access driver to export the data.)

• Create Oracle Data Pump files and the ORACLE_DATAPUMP access driver.

• Copy the Oracle Data Pump files from source system to HDFS.

• Create Hive tables to read the Oracle Data Pump files.

What Is Copy to Hadoop?
Oracle Big Data SQL includes the Oracle Copy to Hadoop utility. This utility makes it simple to identify and copy Oracle data to the Hadoop Distributed File System. It can be accessed either through a command-line interface or via Oracle SQL Developer. Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle Data Pump format. This format optimizes queries thru Big Data SQL:

• The data is stored as Oracle data types – eliminating data type conversions

• The data is queried directly – without requiring the overhead associated with Java SerDes

After generating Data Pump format files from the tables and copying the files to HDFS, you can use Apache Hive to query the data. Hive can process the data locally without accessing Oracle Database. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.

Copy to Hadoop is licensed under Oracle Big Data SQL. You must have an Oracle Big Data SQL license in order to use utility

Stage -2 | Synchronize data from Oracle to Hadoop

Using SharePlex to replicate the data in near real-time.

What is SharePlex ?

SharePlex is the complete enterprise solution for Oracle database replication, including high availability, scalability, data integration and reporting use cases.

Why SharePlex ?

• Redo Log Based Database Replication
• Read the redo log for changes
• Ship the changes from source to target in near-real time
• Requires source and target have identical data to start
• Backup/copy consistent to an SCN
• Maintain synchronization between databases

SharePlex architecture

Docker – Essential Commands

CommandsDescription
docker psList all running containers
docker ps -aList all containers stopped, running
docker stop container-idStop the container which is running
docker start container-idStart the container which is stopped
docker restart container-idRestart the container which is running
docker port container-idList port mappings of a specific container
docker rm container-id or nameRemove the stopped container
docker rm -f container-id or nameRemove the running container forcefully
docker pull image-infoPull the image from docker hub repository
docker pull Pull the image from docker hub repository
docker exec -it container-name /bin/shConnect to linux container and execute commands in container
docker rmi image-idRemove the docker image
docker logoutLogout from docker hub
docker login -u username -p passwordLogin to docker hub
docker statsDisplay a live stream of container(s) resource usage statistics
docker top container-id or nameDisplay the running processes of a container
docker versionShow the Docker version information

Setting storage driver in docker

Reference:

https://docs.docker.com/storage/storagedriver/select-storage-driver/

https://docs.docker.com/storage/storagedriver/

Linux distributionRecommended storage driversAlternative drivers
Docker Engine – Community on Ubuntuoverlay2 or aufs (for Ubuntu 14.04 running on kernel 3.13)overlay¹, devicemapper², zfsvfs
Docker Engine – Community on Debianoverlay2 (Debian Stretch), aufs or devicemapper (older versions)overlay¹, vfs
Docker Engine – Community on CentOSoverlay2overlay¹, devicemapper², zfsvfs
Docker Engine – Community on Fedoraoverlay2overlay¹, devicemapper², zfsvfs

Get the current storage driver:

docker info

Set the storage driver explicitly using the daemon configuration file. This is the method that Docker recommends.

sudo vi /etc/docker/daemon.json

Add the details of storage driver in the daemon configuration file:

{
  "storage-driver": "devicemapper"
}

Restart Docker after editing the file.

sudo systemctl restart docker
sudo systemctl status docker

Installing Docker on CentOS

Install required packages, these packages are pre-requsite for docker installation on CentOS:

sudo yum install -y device-mapper-persistent-data lvm2

Add the Docker CE repo:

sudo yum-config-manager \
    --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo

Install the Docker CE packages and containerd.io:

sudo yum install -y docker-ce-18.09.5 docker-ce-cli-18.09.5 containerd.io

Start and enable the Docker service:

sudo systemctl start docker
sudo systemctl enable docker

Add test_user to the docker group, giving the user permission to run docker commands:

sudo usermod -a -G docker test_user

Log out and back log in and test the installation by running a simple container:

docker run hello-world