Data Warehouse

This article offers a detailed exploration of the design and implementation of bank reconciliation systems within an Online Transaction Processing (OLTP) environment and their integration with Online Analytical Processing (OLAP) systems for enhanced reporting. It navigates through the progression from transactional processing to analytical reporting, including schema designs and practical examples.

Dimensional Modeling is a data warehousing design approach that transforms complex databases into understandable schemas. It structures data using facts and dimensions, facilitating the creation of data cubes that enable sophisticated analytical queries for business intelligence and data analytics applications. This method ensures rapid data retrieval and aids in making informed decisions based on comprehensive data insights.

“Dimensional modeling is explicitly designed to address the need of users to understand the data easily and to analyze it rapidly.” — Ralph Kimball, “The Data Warehouse Toolkit”

OLTP System Schema

The Online Transaction Processing (OLTP) system is the backbone for capturing real-time banking transactions. It’s designed for high transactional volume, ensuring data integrity and quick response times.

Let’s use a fictitious use case to explain concepts and data modelling.

Core Tables Overview

FinancialInstitutions: Holds details about banks, like identifiers and addresses.
FinancialTransactions: Records each transaction needing reconciliation.
LifecycleAssociations: Tracks the transaction’s lifecycle stages.
AuditLogs: Logs all audit-related actions and changes.

Table Functions

FinancialInstitutions: Identifies banks involved in transactions.
FinancialTransactions: Central repository for transaction data.
LifecycleAssociations: Manages transaction progress through different stages.
AuditLogs: Ensures traceability and compliance through logging actions.

“In an OLTP system, the speed of transaction processing is often the key to business competitiveness.” — James Serra, in his blog on Data Warehousing

Transitioning to OLAP for Reporting

Transitioning to an Online Analytical Processing (OLAP) system involves denormalizing OLTP data to optimize for read-heavy analytical queries.

Star Schema Design for Enhanced Reporting

A star schema further refines the data structure for efficient querying, centring around FactTransactionAudit and connected to dimension tables:

DimBank
DimTransactionDetails
DimLifecycleStage
DimAuditTrail

Denormalized OLAP Schema Structure Explanation:

transaction_id: Serves as the primary key of the table, uniquely identifying each transaction in the dataset.

bank_id: Acts as a foreign key linking to the DimBank dimension table, which contains detailed information about each bank.

amount: Records the monetary value of the transaction.

transaction_date: Marks the date when the transaction occurred, useful for time-based analyses and reporting.

audit_id: A foreign key that references the DimAuditTrail dimension table, providing a link to audit information related to the transaction.

lifecycle_stage: Describes the current stage of the transaction within its lifecycle, such as “pending,” “processed,” or “reconciled,” which could be linked to the DimLifecycleStage dimension table for detailed descriptions of each stage.

is_auto_matched: A boolean or flag that indicates whether the transaction was matched automatically (AM), requiring no manual intervention. This is crucial for reporting and analyzing the efficiency of the reconciliation process.

“The objective of the data warehouse is to provide a coherent picture of the business at a point in time.” — Bill Inmon, “Building the Data Warehouse”

Reporting Use Cases: Auto Match Reporting

Identify transactions reconciled automatically. This requires joining the FactTransactionAudit table with dimension tables to filter auto-matched transactions.

Access Path for Auto Match Transactions

SELECT
    dt.transaction_id,
    db.name AS bank_name,
    -- Additional fields
FROM
    FactTransactionAudit fta
JOIN
    DimTransactionDetails dt ON fta.FK_transaction_id = dt.transaction_id
-- Additional joins
WHERE
    fta.is_auto_matched = TRUE;

Partitioning Strategy

Partitioning helps manage large datasets by dividing them into more manageable parts based on certain fields, often improving query performance by allowing systems to read only relevant partitions.

Suggested Partitioning Scheme:

For the ConsolidatedTransactions table in an OLAP setting like Hive:

By Date: Partitioning by transaction date (e.g., transaction_date) is a natural choice for financial data, allowing efficient queries over specific periods.

Structure: /year=YYYY/month=MM/day=DD/.

By Bank: If analyses often filter by specific banks, adding a secondary level of partitioning by bank_id can further optimize access patterns.

Structure: /year=YYYY/month=MM/bank_id=XYZ/.

CREATE TABLE ConsolidatedTransactions (
    transaction_id STRING,
    bank_name STRING,
    transaction_description STRING,
    transaction_amount FLOAT,
    lifecycle_description STRING,
    audit_action STRING,
    audit_timestamp TIMESTAMP
)
PARTITIONED BY (year STRING, month STRING, day STRING, bank_id STRING)
STORED AS PARQUET;

Optimal Partitioning Strategy

Processing billions of transactions every month can result in generating too many small files if daily partitioning is used. However, if monthly partitioning is applied, it may lead to the creation of very large files that are inefficient to process. To strike a balance between the two, a weekly or bi-weekly partitioning scheme can be adopted. This approach can reduce the overall number of files generated and keep the file sizes manageable.

# Assuming df is your DataFrame loaded with the transaction data
df = df.withColumn("year_week", weekofyear(col("transaction_date")))

“In large databases, partitioning is critical for both performance and manageability.” — C.J. Date, “An Introduction to Database Systems”

Efficient Data Writes

Given the data volume, using repartition based on the partitioning scheme before writing can help distribute the data more evenly across the partitions, especially if the transactions are not uniformly distributed across the period.

Writing Data with Adaptive Repartitioning

Considering the volume, dynamic repartitioning based on the data characteristics of each write operation is necessary.

# Dynamically repartition based on the number of records
# Aim for partitions with around 50 million to 100 million records each

num_partitions = df.count() // 50000000
if num_partitions < 1:
    num_partitions = 1  # Ensure at least one partition

df.repartition(num_partitions, "year", "year_week", "bank_id") \
  .write \
  .mode("overwrite") \
  .partitionBy("year", "year_week", "bank_id") \
  .format("parquet") \
  .saveAsTable("ConsolidatedTransactions")

Dynamic Partitioning for Incremental Loads

For incremental loads, enabling dynamic partitioning in Spark is crucial to ensure that only the relevant partitions are updated without scanning or rewriting the entire dataset.

# Enable dynamic partitioning
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")

# Write incremental data
df.write \
  .mode("append") \
  .partitionBy("year", "year_week", "bank_id") \
  .format("parquet") \
  .saveAsTable("ConsolidatedTransactions")

Other Considerations

Managing Small Files Issue

Small files can degrade performance in Hadoop ecosystems by overwhelming the NameNode with metadata operations and causing excessive overhead during processing.

Solutions:

Compaction Jobs: Regularly run compaction jobs to merge small files into larger ones within each partition. Spark can be used to read in small files and coalesce them into a smaller number of larger files.

Spark Repartition/Coalesce: When writing data out from Spark, use repartition to increase the number of partitions (and thus files) if needed, or coalesce to reduce them, depending on your use case.

Writing Data with Coalesce to Avoid Small Files:

# Assuming df is your DataFrame loaded with the data ready to be written

df.coalesce(10) \  # Adjust this number based on your specific needs
  .write \
  .mode("overwrite") \
  .partitionBy("year", "month", "day", "bank_id") \
  .format("parquet") \
  .saveAsTable("ConsolidatedTransactions")

Using Repartition for Better File Distribution

# Adjust the partition column names and number of partitions as needed

df.repartition(10, "year", "month", "day", "bank_id") \
  .write \
  .mode("overwrite") \
  .partitionBy("year", "month", "day", "bank_id") \
  .format("parquet") \
  .saveAsTable("ConsolidatedTransactions")

Efficiently handling this data volume requires monitoring and tuning Spark’s execution parameters and memory.

Cluster Capacity: Ensure the underlying hardware and cluster resources are scaled appropriately to handle the data volume and processing needs.
Archival Strategy: Implement a data archival or purging strategy for older data that is no longer actively queried to manage overall storage requirements.

To manage billions of transactions every month, it’s necessary to plan and optimize data storage and processing strategies. You can significantly improve the performance and scalability of your big data system by utilizing partitioning schemes that balance the file size with the number of files and by dynamically adjusting to the data volume during writes.

Stackademic 🎓

Thank you for reading until the end. Before you go:

Please consider clapping and following the writer! 👏
Follow us X | LinkedIn | YouTube | Discord
Visit our other platforms: In Plain English | CoFeed | Venture | Cubed
More content at Stackademic.com

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Shanoj

Learn.Share.Grow

Tag Archives: Data Warehouse

Data Modelling for Data Warehouses: Bridging OLTP and OLAP for Advanced Data Analytics