Monthly Archives: April 2024

AWS-Powered Banking: Automating Reconciliation with Cloud Efficiency

This article explains how a Bank Reconciliation System is structured on AWS, with the aim of processing and reconciling banking transactions. The system automates the matching of transactions from batch feeds and provides a user interface for manually reconciling any open items.

Architecture Overview

The BRS (Bank Reconciliation System) is engineered to support high-volume transaction processing with an emphasis on automation, accuracy, and user engagement for manual interventions. The system incorporates AWS cloud services to ensure scalability, availability, and security.

Technical Flow

  1. Batch Feed Ingestion: Transaction files, referred to as “left” and “right” feeds, are exported from an on-premises data center into the AWS environment.
  2. Storage and Processing: Files are stored in an S3 bucket, triggering AWS Lambda functions.
  3. Automated Reconciliation: Lambda functions process the batch feeds to perform automated matching of transactions. Matched transactions are termed “auto-match.”
  4. Database Storage: Both the auto-matched transactions and the unmatched transactions, known as “open items,” are stored in an Amazon Aurora database.
  5. Application Layer: A backend application, developed with Spring Boot, interacts with the database to retrieve and manage transaction data.
  6. User Interface: An Angular front-end application presents the open items to application users (bank employees) for manual reconciliation.

System Components

  • AWS S3: Initial repository for batch feeds. Its event-driven capabilities trigger processing via Lambda.
  • AWS Lambda: The serverless compute layer that processes batch feeds and performs auto-reconciliation.
  • Amazon Aurora: A MySQL and PostgreSQL compatible relational database used to store both auto-matched and open transactions.
  • Spring Boot: Provides the backend services that facilitate the retrieval and management of transaction data for the front-end application.
  • Angular: The front-end framework used to build the user interface for the manual reconciliation process.

System Interaction

  1. Ingestion: Batch feeds from the on-premises data center are uploaded to AWS S3.
  2. Triggering Lambda: S3 events upon file upload automatically invoke Lambda functions dedicated to processing these feeds.
  3. Processing: Lambda functions parse the batch feeds, automatically reconcile transactions where possible, and identify open items for manual reconciliation.
  4. Storing Results: Lambda functions store the outcomes in the Aurora database, segregating auto-matched and open items.
  5. User Engagement: The Spring Boot application provides an API for the Angular front-end, through which bank employees access and work on open items.
  6. Manual Reconciliation: Users perform manual reconciliations via the Angular application, which updates the status of transactions within the Aurora database accordingly.

Security and Compliance

  • Data Encryption: All data in transit and at rest are encrypted using AWS security services.
  • Identity Management: Amazon Cognito ensures secure user authentication for application access.
  • Web Application Firewall: AWS WAF protects against common web threats and vulnerabilities.

Monitoring and Reliability

  • CloudWatch: Monitors the system, logging all events, and setting up alerts for anomalies.
  • High Availability: The system spans multiple Availability Zones for resilience and employs Elastic Load Balancing for traffic distribution.

Scalability

  • Elastic Beanstalk & EKS: Both services can scale the compute resources automatically in response to the load, ensuring that the BRS can handle peak volumes efficiently.

Note: When you deploy an application using Elastic Beanstalk, it automatically sets up an Elastic Load Balancer in front of the EC2 instances that are running your application. This is to distribute incoming traffic across those instances to balance the load and provide fault tolerance.

Cost Optimization

  • S3 Intelligent-Tiering: Manages storage costs by automatically moving less frequently accessed data to lower-cost tiers.

DevOps Practices

  • CodeCommit & ECR: Source code management and container image repository are handled via AWS CodeCommit and ECR, respectively, streamlining the CI/CD pipeline.

The BRS leverages AWS services to create a seamless, automated reconciliation process, complemented by an intuitive user interface for manual intervention, ensuring a robust solution for the bank’s reconciliation needs.

Stackademic 🎓

Thank you for reading until the end. Before you go:

Data Modelling for Data Warehouses: Bridging OLTP and OLAP for Advanced Data Analytics

This article offers a detailed exploration of the design and implementation of bank reconciliation systems within an Online Transaction Processing (OLTP) environment and their integration with Online Analytical Processing (OLAP) systems for enhanced reporting. It navigates through the progression from transactional processing to analytical reporting, including schema designs and practical examples.

Dimensional Modeling is a data warehousing design approach that transforms complex databases into understandable schemas. It structures data using facts and dimensions, facilitating the creation of data cubes that enable sophisticated analytical queries for business intelligence and data analytics applications. This method ensures rapid data retrieval and aids in making informed decisions based on comprehensive data insights.

“Dimensional modeling is explicitly designed to address the need of users to understand the data easily and to analyze it rapidly.” — Ralph Kimball, “The Data Warehouse Toolkit”

OLTP System Schema

The Online Transaction Processing (OLTP) system is the backbone for capturing real-time banking transactions. It’s designed for high transactional volume, ensuring data integrity and quick response times.

Let’s use a fictitious use case to explain concepts and data modelling.

Core Tables Overview

  1. FinancialInstitutions: Holds details about banks, like identifiers and addresses.
  2. FinancialTransactions: Records each transaction needing reconciliation.
  3. LifecycleAssociations: Tracks the transaction’s lifecycle stages.
  4. AuditLogs: Logs all audit-related actions and changes.

Table Functions

  • FinancialInstitutions: Identifies banks involved in transactions.
  • FinancialTransactions: Central repository for transaction data.
  • LifecycleAssociations: Manages transaction progress through different stages.
  • AuditLogs: Ensures traceability and compliance through logging actions.

“In an OLTP system, the speed of transaction processing is often the key to business competitiveness.” — James Serra, in his blog on Data Warehousing

Transitioning to OLAP for Reporting

Transitioning to an Online Analytical Processing (OLAP) system involves denormalizing OLTP data to optimize for read-heavy analytical queries.

Star Schema Design for Enhanced Reporting

A star schema further refines the data structure for efficient querying, centring around FactTransactionAudit and connected to dimension tables:

  • DimBank
  • DimTransactionDetails
  • DimLifecycleStage
  • DimAuditTrail

Denormalized OLAP Schema Structure Explanation:

transaction_id: Serves as the primary key of the table, uniquely identifying each transaction in the dataset.

bank_id: Acts as a foreign key linking to the DimBank dimension table, which contains detailed information about each bank.

amount: Records the monetary value of the transaction.

transaction_date: Marks the date when the transaction occurred, useful for time-based analyses and reporting.

audit_id: A foreign key that references the DimAuditTrail dimension table, providing a link to audit information related to the transaction.

lifecycle_stage: Describes the current stage of the transaction within its lifecycle, such as “pending,” “processed,” or “reconciled,” which could be linked to the DimLifecycleStage dimension table for detailed descriptions of each stage.

is_auto_matched: A boolean or flag that indicates whether the transaction was matched automatically (AM), requiring no manual intervention. This is crucial for reporting and analyzing the efficiency of the reconciliation process.

“The objective of the data warehouse is to provide a coherent picture of the business at a point in time.” — Bill Inmon, “Building the Data Warehouse”

Reporting Use Cases: Auto Match Reporting

Identify transactions reconciled automatically. This requires joining the FactTransactionAudit table with dimension tables to filter auto-matched transactions.

Access Path for Auto Match Transactions

SELECT
dt.transaction_id,
db.name AS bank_name,
-- Additional fields
FROM
FactTransactionAudit fta
JOIN
DimTransactionDetails dt ON fta.FK_transaction_id = dt.transaction_id
-- Additional joins
WHERE
fta.is_auto_matched = TRUE;

Partitioning Strategy

Partitioning helps manage large datasets by dividing them into more manageable parts based on certain fields, often improving query performance by allowing systems to read only relevant partitions.

Suggested Partitioning Scheme:

For the ConsolidatedTransactions table in an OLAP setting like Hive:

By Date: Partitioning by transaction date (e.g., transaction_date) is a natural choice for financial data, allowing efficient queries over specific periods.

Structure: /year=YYYY/month=MM/day=DD/.

By Bank: If analyses often filter by specific banks, adding a secondary level of partitioning by bank_id can further optimize access patterns.

Structure: /year=YYYY/month=MM/bank_id=XYZ/.

CREATE TABLE ConsolidatedTransactions (
transaction_id STRING,
bank_name STRING,
transaction_description STRING,
transaction_amount FLOAT,
lifecycle_description STRING,
audit_action STRING,
audit_timestamp TIMESTAMP
)
PARTITIONED BY (year STRING, month STRING, day STRING, bank_id STRING)
STORED AS PARQUET;

Optimal Partitioning Strategy

Processing billions of transactions every month can result in generating too many small files if daily partitioning is used. However, if monthly partitioning is applied, it may lead to the creation of very large files that are inefficient to process. To strike a balance between the two, a weekly or bi-weekly partitioning scheme can be adopted. This approach can reduce the overall number of files generated and keep the file sizes manageable.

# Assuming df is your DataFrame loaded with the transaction data
df = df.withColumn("year_week", weekofyear(col("transaction_date")))

“In large databases, partitioning is critical for both performance and manageability.” — C.J. Date, “An Introduction to Database Systems”

Efficient Data Writes

Given the data volume, using repartition based on the partitioning scheme before writing can help distribute the data more evenly across the partitions, especially if the transactions are not uniformly distributed across the period.

Writing Data with Adaptive Repartitioning

Considering the volume, dynamic repartitioning based on the data characteristics of each write operation is necessary.

# Dynamically repartition based on the number of records
# Aim for partitions with around 50 million to 100 million records each

num_partitions = df.count() // 50000000
if num_partitions < 1:
num_partitions = 1 # Ensure at least one partition

df.repartition(num_partitions, "year", "year_week", "bank_id") \
.write \
.mode("overwrite") \
.partitionBy("year", "year_week", "bank_id") \
.format("parquet") \
.saveAsTable("ConsolidatedTransactions")

Dynamic Partitioning for Incremental Loads

For incremental loads, enabling dynamic partitioning in Spark is crucial to ensure that only the relevant partitions are updated without scanning or rewriting the entire dataset.

# Enable dynamic partitioning
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")

# Write incremental data
df.write \
.mode("append") \
.partitionBy("year", "year_week", "bank_id") \
.format("parquet") \
.saveAsTable("ConsolidatedTransactions")

Other Considerations

Managing Small Files Issue

Small files can degrade performance in Hadoop ecosystems by overwhelming the NameNode with metadata operations and causing excessive overhead during processing.

Solutions:

Compaction Jobs: Regularly run compaction jobs to merge small files into larger ones within each partition. Spark can be used to read in small files and coalesce them into a smaller number of larger files.

Spark Repartition/Coalesce: When writing data out from Spark, use repartition to increase the number of partitions (and thus files) if needed, or coalesce to reduce them, depending on your use case.

Writing Data with Coalesce to Avoid Small Files:

# Assuming df is your DataFrame loaded with the data ready to be written

df.coalesce(10) \ # Adjust this number based on your specific needs
.write \
.mode("overwrite") \
.partitionBy("year", "month", "day", "bank_id") \
.format("parquet") \
.saveAsTable("ConsolidatedTransactions")

Using Repartition for Better File Distribution

# Adjust the partition column names and number of partitions as needed

df.repartition(10, "year", "month", "day", "bank_id") \
.write \
.mode("overwrite") \
.partitionBy("year", "month", "day", "bank_id") \
.format("parquet") \
.saveAsTable("ConsolidatedTransactions")

Efficiently handling this data volume requires monitoring and tuning Spark’s execution parameters and memory.

  • Cluster Capacity: Ensure the underlying hardware and cluster resources are scaled appropriately to handle the data volume and processing needs.
  • Archival Strategy: Implement a data archival or purging strategy for older data that is no longer actively queried to manage overall storage requirements.

To manage billions of transactions every month, it’s necessary to plan and optimize data storage and processing strategies. You can significantly improve the performance and scalability of your big data system by utilizing partitioning schemes that balance the file size with the number of files and by dynamically adjusting to the data volume during writes.

Stackademic 🎓

Thank you for reading until the end. Before you go: