Tag Archives: hdfs

Moving data from oracle to HDFS

I divided the scope into two stages:

Stage -1 | Initial data load from Oracle database to HDFS (for bulk load)

  • Here I used oracle native utility (Copy2hadoop) for reducing the friction

Stage -2 | Synchronize data from Oracle to Hadoop (delta changes)

  • Here I used SharePlex, comparatively effective third party tool (Quest/Dell) for data replication

Stage -1 | Initial data load from Oracle database to HDFS

Generally, we have two tools for data movement from the Oracle Database to HDFS

Sqoop and Copy2Hadoop :

Copy2Hadoop has advantages over sqoop, In my case we have complex query(Functions, PL/SQL, Joins) or even views (like a sub-case for complex queries). So, it is recommended to use Copy2Hadoop for the initial load. Since it is oracle native utility it will save our time and effort.

Copy2Hadoop creates datapump files that could have any Oracle datatype which is supported by Oracle External table and store data in Oracle’s datatype. When we convert datatype from Oracle format to the Java format (like sqoop does) there is always risk that some information would be converted incorrectly. Copy2Hadoop insures us from that.

Below are the high-level steps to copy the data:

Note:
Create Oracle Data Pump files from Oracle Database table data that we want to move. 
We can do this by exporting the data from the Oracle Database table using the oracle_datapump
access driver (Using Oracle Data Pump files generated by other database utilities will
not be accessible by Hive tables.  We need to use the oracle_datapump access driver to export the data.)

• Create Oracle Data Pump files and the ORACLE_DATAPUMP access driver.

• Copy the Oracle Data Pump files from source system to HDFS.

• Create Hive tables to read the Oracle Data Pump files.

What Is Copy to Hadoop?
Oracle Big Data SQL includes the Oracle Copy to Hadoop utility. This utility makes it simple to identify and copy Oracle data to the Hadoop Distributed File System. It can be accessed either through a command-line interface or via Oracle SQL Developer. Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle Data Pump format. This format optimizes queries thru Big Data SQL:

• The data is stored as Oracle data types – eliminating data type conversions

• The data is queried directly – without requiring the overhead associated with Java SerDes

After generating Data Pump format files from the tables and copying the files to HDFS, you can use Apache Hive to query the data. Hive can process the data locally without accessing Oracle Database. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.

Copy to Hadoop is licensed under Oracle Big Data SQL. You must have an Oracle Big Data SQL license in order to use utility

Stage -2 | Synchronize data from Oracle to Hadoop

Using SharePlex to replicate the data in near real-time.

What is SharePlex ?

SharePlex is the complete enterprise solution for Oracle database replication, including high availability, scalability, data integration and reporting use cases.

Why SharePlex ?

• Redo Log Based Database Replication
• Read the redo log for changes
• Ship the changes from source to target in near-real time
• Requires source and target have identical data to start
• Backup/copy consistent to an SCN
• Maintain synchronization between databases

SharePlex architecture