Category Archives: BigData

Install Apache Flink on Multi-node Cluster: RHE8

Preparation:

Set up a password less SSH connection between the nodes for easy communication

Setting up the cluster nodes:

Install the latest version of Java on all nodes in the cluster.

sudo yum install java-1.8.0-openjdk-devel

Install Apache ZooKeeper on all nodes in the cluster.

ZooKeeper is used for coordination between the nodes.

Install ZooKeeper 3.6.2:

wget https://downloads.apache.org/zookeeper/zookeeper-3.6.2/apache-zookeeper-3.6.2-bin.tar.gz 

tar -xvf apache-zookeeper-3.6.2-bin.tar.gz 

sudo mv apache-zookeeper-3.6.2-bin /usr/local/zookeeper

Installing Apache Flink:

Download the latest version of Apache Flink (1.16.1) from the official website:

wget https://mirrors.ocf.berkeley.edu/apache/flink/flink-1.16.1/flink-1.16.1-bin-scala_2.12.tgz

Unpack the archive to a directory on all nodes in the cluster:

tar -xvf flink-1.16.1-bin-scala_2.12.tgz sudo mv flink-1.16.1 /usr/local/flink

Configuring the Apache Flink cluster:

Create a copy of the flink-conf.yaml configuration file and customize it:

cd /usr/local/flink/conf 
cp flink-conf.yaml flink-conf.yaml.orig

Configure the jobmanager.rpc.address setting to the hostname or IP address of the master node.
Configure the taskmanager.numberOfTaskSlots setting to the number of parallel tasks that each task manager should run.
Configure the taskmanager.memory.process.size setting to the amount of memory that each task manager should use.

for example:

taskmanager.memory.process.size: 4GB
taskmanager.numberOfTaskSlots: 30

Configure the high-availability section to set up a high-availability setup using ZooKeeper:

The zoo.cfg file is the configuration file for ZooKeeper, which is used to set up a high-availability setup for Apache Flink.

The following details need to be added to this file:

Data Directory: Specify the directory where ZooKeeper will store its data.
Client Port: Specify the port that ZooKeeper will listen on for client connections.
Server List: Specify a list of servers in the ZooKeeper ensemble, including the hostname and client port for each server.
Tick Time: Specify the length of a single tick, which is the basic time unit used by ZooKeeper.
Init Limit: Specify the number of ticks that the initial synchronization phase between a ZooKeeper server and its followers can take.
Sync Limit: Specify the number of ticks that a follower can be behind a leader.
Snapshot Counter: Specify the number of transactions that can be processed before a snapshot of the ZooKeeper state is taken.

Here’s an example of a basic zoo.cfg file:

dataDir=/tmp/zookeeper 
clientPort=2181 

server.1=localhost:2888:3888 
server.2=localhost:2889:3889 
server.3=localhost:2890:3890 

tickTime=2000 
initLimit=10 
syncLimit=5 
snapCount=1000

Add the following lines to the flink-conf.yaml file

high-availability: zookeeper 
high-availability.zookeeper.quorum: host1:port,host2:port,host3:port 
high-availability.zookeeper.path.root: /flink:

Note: Replace host1:port, host2:port, and host3:port with the hostnames and ports of the ZooKeeper nodes in your cluster.

Starting the cluster:

Start ZooKeeper on all nodes in the cluster:

cd /usr/local/zookeeper/bin ./zkServer.sh start

Start the JobManager on the master node by running the following command:

cd /usr/local/flink/bin ./standalone-job.sh start

Start the TaskManagers on all other nodes by running the following command on each node:

cd /usr/local/flink/bin ./taskmanager.sh start

Here are the additional steps for setting up TLS/SSL/HTTPS :

Obtain a certificate
Install the certificate: Copy the certificate and private key files to a location on each node in the cluster. The location should be accessible to the user that runs the Flink process.
Install OpenSSL: If it’s not already installed, install the OpenSSL package on each node in the cluster. You can do this by running the following command:

sudo yum install openssl

Configure Flink: Modify the flink-conf.yaml file on each node to enable SSL/TLS and specify the location of the certificate and private key files. Here is an example configuration:

security.ssl.enabled: true 
security.ssl.certificate: /path/to/cert.pem 
security.ssl.private-key: /path/to/key.pem

Restart the nodes: After making the changes to the configuration file, restart the Job Manager and Task Manager nodes.

Verify the configuration: You can verify that the configuration is working by accessing the Flink web UI using an HTTPS URL (e.g. https://<jobmanager_host>:8081).The browser should show that the connection is secure and that the certificate was issued by a trusted CA.

Moving data from oracle to HDFS

1 Reply

I divided the scope into two stages:

Stage -1 | Initial data load from Oracle database to HDFS (for bulk load)

Here I used oracle native utility (Copy2hadoop) for reducing the friction

Stage -2 | Synchronize data from Oracle to Hadoop (delta changes)

Here I used SharePlex, comparatively effective third party tool (Quest/Dell) for data replication

Stage -1 | Initial data load from Oracle database to HDFS

Generally, we have two tools for data movement from the Oracle Database to HDFS

Sqoop and Copy2Hadoop :

Copy2Hadoop has advantages over sqoop, In my case we have complex query(Functions, PL/SQL, Joins) or even views (like a sub-case for complex queries). So, it is recommended to use Copy2Hadoop for the initial load. Since it is oracle native utility it will save our time and effort.

Copy2Hadoop creates datapump files that could have any Oracle datatype which is supported by Oracle External table and store data in Oracle’s datatype. When we convert datatype from Oracle format to the Java format (like sqoop does) there is always risk that some information would be converted incorrectly. Copy2Hadoop insures us from that.

Below are the high-level steps to copy the data:

Note:
Create Oracle Data Pump files from Oracle Database table data that we want to move.
We can do this by exporting the data from the Oracle Database table using the oracle_datapump
access driver (Using Oracle Data Pump files generated by other database utilities will
not be accessible by Hive tables. We need to use the oracle_datapump access driver to export the data.)

• Create Oracle Data Pump files and the ORACLE_DATAPUMP access driver.

• Copy the Oracle Data Pump files from source system to HDFS.

• Create Hive tables to read the Oracle Data Pump files.

What Is Copy to Hadoop?
Oracle Big Data SQL includes the Oracle Copy to Hadoop utility. This utility makes it simple to identify and copy Oracle data to the Hadoop Distributed File System. It can be accessed either through a command-line interface or via Oracle SQL Developer. Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle Data Pump format. This format optimizes queries thru Big Data SQL:

• The data is stored as Oracle data types – eliminating data type conversions

• The data is queried directly – without requiring the overhead associated with Java SerDes

After generating Data Pump format files from the tables and copying the files to HDFS, you can use Apache Hive to query the data. Hive can process the data locally without accessing Oracle Database. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.

Copy to Hadoop is licensed under Oracle Big Data SQL. You must have an Oracle Big Data SQL license in order to use utility

Stage -2 | Synchronize data from Oracle to Hadoop

Using SharePlex to replicate the data in near real-time.

What is SharePlex ?

SharePlex is the complete enterprise solution for Oracle database replication, including high availability, scalability, data integration and reporting use cases.

Why SharePlex ?

• Redo Log Based Database Replication
• Read the redo log for changes
• Ship the changes from source to target in near-real time
• Requires source and target have identical data to start
• Backup/copy consistent to an SCN
• Maintain synchronization between databases

SharePlex architecture

Using Kafka Connect to Capture Data from a Relational Database (sqlite3)

Leave a reply

Use any Kafka docker images to install and start kafka.

reference:

https://docs.confluent.io/current/connect/userguide.html
https://github.com/bitnami/bitnami-docker-kafka
https://docs.confluent.io/3.1.1/connect/connect-jdbc/docs/sink_connector.html

JDBC driver download for SQLlite3:
https://bitbucket.org/xerial/sqlite-jdbc/downloads/

Start Kafka.

confluent start

Install SQLite3.

apt-get update
apt-get install sqlite3

Create a New Database and Populate It with a Table and Some Data
Create a new database called “test.db”.

root@shanoj_srv1:/# sqlite3 test.db

Create a new table in the SQLite database called “accounts”.

CREATE TABLE accounts (id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 
name VARCHAR (255));

Insert values into the table to begin populating it.

INSERT INTO accounts(name) VALUES('sabu');
INSERT INTO accounts(name) VALUES('ronnie');
.quit

Stop Kafka Connect.

confluent stop connect

Make necessary changes to below files:

root@shanoj_srv1:/# vi /etc/schema-registry/connect-avro-standalone.properties
bootstrap.servers=localhost:9092
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081

# The internal converter used for offsets and config data is configurable and must be specified,
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false

# Local storage file for offset data
offset.storage.file.filename=/tmp/connect.offsets

root@shanoj_srv1:/# vi etc/kafka-connect-jdbc/source-quickstart-sqlite.properties

# A simple example that copies all tables from a SQLite database. The first few settings are
# required for all connectors: a name, the connector class to run, and the maximum number of
# tasks to create:
name=test-source-sqlite-jdbc-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
# The remaining configs are specific to the JDBC source connector. In this example, we connect to a
# SQLite database stored in the file test.db, use and auto-incrementing column called 'id' to
# detect new rows as they are added, and output to topics prefixed with 'test-sqlite-jdbc-', e.g.
# a table called 'users' will be written to the topic 'test-sqlite-jdbc-users'.
connection.url=jdbc:sqlite:test.db
mode=incrementing
incrementing.column.name=id

Start Kafka Connect in standalone mode.

root@shanoj_srv1:/#connect-standalone -daemon /etc/schema-registry/connect-avro-standalone.properties /etc/kafka-connect-jdbc/source-quickstart-sqlite.properties

Verify that the connector was created.

root@shanoj_srv1:/# cat /logs/connectStandalone.out | grep -i "finished"
[2019-08-15 15:45:49,421] INFO Finished creating connector test-source-sqlite-jdbc-autoincrement (org.apache.kafka.connect.runtime.Worker:225)
[2019-08-15 15:45:49,504] INFO Source task WorkerSourceTask{id=test-source-sqlite-jdbc-autoincrement-0} finished initialization and start (org.apache.kafka.connect.runtime.WorkerSourceTask:143)
[2019-08-15 15:46:49,484] INFO Finished WorkerSourceTask{id=test-source-sqlite-jdbc-autoincrement-0} commitOffsets successfully in 6 ms (org.apache.kafka.connect.runtime.WorkerSourceTask:373)
root@shanoj_srv1:/# curl -s localhost:8083/connectors

Examine the Kafka topic created.

root@shanoj_srv1:/# kafka-topics --list --zookeeper localhost:2181 | grep test-sqlite-jdbc
test-sqlite-jdbc-accounts

Start a Kafka Consumer and Write New Data to the Database

Open a Kafka consumer.

root@shanoj_srv1:/# kafka-avro-console-consumer --new-consumer --bootstrap-server localhost:9092 --topic test-sqlite-jdbc-accounts --from-beginning

Open a new tab to a new terminal session.
Open a new shell in this session.

root@shanoj_srv1:/# sudo docker exec -it sqlite-test //bin//bash

Transfer to the tmp directory.

root@shanoj_srv1:/# cd /tmp

Access the SQLite database test.db.

root@shanoj_srv1:/# sqlite3 test.db

Insert a new value into the accounts table.

root@shanoj_srv1:/tmp# sqlite3 test.db
SQLite version 3.8.7.1 2014-10-29 13:59:56
Enter ".help" for usage hints.
sqlite> INSERT INTO accounts(name) VALUES('rama');
sqlite> INSERT INTO accounts(name) VALUES('lev');
sqlite> INSERT INTO accounts(name) VALUES('sriram');
sqlite> INSERT INTO accounts(name) VALUES('joby');
sqlite> INSERT INTO accounts(name) VALUES('shanoj');
sqlite>

Return to the previous session with the consumer and verify the data has been written.

root@ip-10-0-1-100:/# kafka-avro-console-consumer --new-consumer --bootstrap-server localhost:9092 --topic test-sqlite-jdbc-accounts --from-beginning
{"id":3,"name":{"string":"rama"}}
{"id":4,"name":{"string":"lev"}}
{"id":5,"name":{"string":"sriram"}}
{"id":6,"name":{"string":"joby"}}
{"id":7,"name":{"string":"shanoj"}}

Install and Configure PostgreSQL 9.x: RHEL/CentOS

Leave a reply

1.Download and install it using the appropriate package management

~ $ rpm -Uvh https://yum.postgresql.org/9.4/redhat/rhel-7-x86_64/pgdg-centos94-9.4-3.noarch.rpm
Retrieving https://yum.postgresql.org/9.4/redhat/rhel-7-x86_64/pgdg-centos94-9.4-3.noarch.rpm
warning: /var/tmp/rpm-tmp.IZow7N: Header V4 DSA/SHA1 Signature, key ID 442df0f8: NOKEY
Preparing... ################################# [100%]
Updating / installing...
1:pgdg-redhat-repo-42.0-4 ################################# [100%]

2. Applying any necessary updates.

[root@tcox6 ~]# yum update

3. Install the PostgreSQL 9.4 server and associated contribed modules and utilities. Once installed, run the database initialization routine before starting the database.

[root@tcox6 ~]# yum install postgresql94-server postgresql94-contrib

4. Enable the PostgreSQL 9.4 server to run on system start and then start the database server.

[root@tcox6 ~]# systemctl enable postgresql-9.4

ln -s '/usr/lib/systemd/system/postgresql-9.4.service' '/etc/systemd/system/multi-user.target.wants/postgresql-9.4.service'

[root@tcox6 ~]# systemctl start postgresql-9.4

5. Check to see if SELinux is being run in enforced mode on your system. If so, run the command to allow external HTTP DB connections to the server through SELinux configuration.

# cat /etc/selinux/config

# This file controls the state of SELinux on the system.

# SELINUX= can take one of these three values:

# enforcing - SELinux security policy is enforced.

# permissive - SELinux prints warnings instead of enforcing.

# disabled - No SELinux policy is loaded.

SELINUX=enforcing

# SELINUXTYPE= can take one of these two values:

# targeted - Targeted processes are protected,

# minimum - Modification of targeted policy. Only selected processes are protected.

# mls - Multi Level Security protection.

SELINUXTYPE=targeted

# setsebool -P httpd_can_network_connect_db 1

6. Login to the ‘postgres’ user and run the ‘psql’ command. Once at the database prompt, set a password for the ‘psql’ user.

[root@tcox6 ~]# su - postgres

Last login: Wed Sep 2 13:35:21 UTC 2015 on pts/0

-bash-4.2$ psql

psql (9.4.4)

Type "help" for help.

postgres=# \password postgres

Enter new password:

Enter it again:

postgres=# quit

postgres-# \q

-bash-4.2$ exit

logout

Hadoop Installation with Single DataNode on Ubuntu (VirtualBox)

Leave a reply

Download link!

The above document contains :

01) Updating the Ubuntu repository.
02) Installing openjdk-6-jdk.
03) Installing openssh-server.
04) Downloading and installing Hadoop.
05) Configuring Hadoop enveronment.
07) Create a ssh key.
08) Format the name node
09) Start the namenode, datanode
10) Start the task tracker and job tracker.
11) Checking if Hadoop started correctly.

Shanoj

Learn.Share.Grow

Category Archives: BigData

Install Apache Flink on Multi-node Cluster: RHE8

Preparation:

Setting up the cluster nodes:

Starting the cluster:

Here are the additional steps for setting up TLS/SSL/HTTPS :

Moving data from oracle to HDFS

Using Kafka Connect to Capture Data from a Relational Database (sqlite3)

Install and Configure PostgreSQL 9.x: RHEL/CentOS

Hadoop Installation with Single DataNode on Ubuntu (VirtualBox)