Tag Archives: Amazon Redshift

Distribution Styles in Amazon Redshift: A Banking Reconciliation Use Case

When loading data into a table in Amazon Redshift, the rows are distributed across the node slices according to the table’s designated distribution style. Selecting the right distribution style (DISTSTYLE) is crucial for optimizing performance.

  • The primary goal is evenly distributing the data across the cluster, ensuring efficient parallel processing.
  • The secondary goal is to minimize the cost of data movement during query processing. Ideally, the data should be positioned where it’s needed before the query is executed, reducing unnecessary data shuffling.

Let’s bring this concept to life with an example from the banking industry, specifically focused on reconciliation processes — a common yet critical operation in financial institutions.

In a banking reconciliation system, transactions from various accounts and systems (e.g., internal bank records and external clearing houses) must be matched and validated to ensure accuracy. This process often involves large datasets with numerous transactions that need to be compared across different tables.

Example Table Structures

To demonstrate how different distribution styles can be applied, consider the following sample tables:

Transactions Table (Internal Bank Records)

CREATE TABLE internal_transactions (
transaction_id BIGINT,
account_number VARCHAR(20),
transaction_date DATE,
transaction_amount DECIMAL(10,2),
transaction_type VARCHAR(10)
)
DISTSTYLE KEY
DISTKEY (transaction_id);

The internal_transactions table is distributed using the KEY distribution style on the transaction_id column. This means that records with the same transaction_id will be stored together on the same node slice. This is particularly useful when these transactions are frequently joined with another table, such as external transactions, on the transaction_id.

Transactions Table (External Clearing House Records)

CREATE TABLE external_transactions (
transaction_id BIGINT,
clearinghouse_id VARCHAR(20),
transaction_date DATE,
transaction_amount DECIMAL(10,2),
status VARCHAR(10)
)
DISTSTYLE KEY
DISTKEY (transaction_id);

Similar to the internal transactions table, the external_transactions table is also distributed using the KEY distribution style on the transaction_id column. This ensures that when a join operation is performed between the internal and external transactions on the transaction_id, the data is already co-located, minimizing the need for data movement and speeding up the reconciliation process.

CREATE TABLE currency_exchange_rates (
currency_code VARCHAR(3),
exchange_rate DECIMAL(10,4),
effective_date DATE
)
DISTSTYLE ALL;

The currency_exchange_rates table uses the ALL distribution style. A full copy of this table is stored on the first slice of each node, which is ideal for small reference tables that are frequently joined with larger tables (such as transactions) but are not updated frequently. This eliminates the need for data movement during joins and improves query performance.

CREATE TABLE audit_logs (
log_id BIGINT IDENTITY(1,1),
transaction_id BIGINT,
action VARCHAR(100),
action_date TIMESTAMP,
user_id VARCHAR(50)
)
DISTSTYLE EVEN;

The audit_logs table uses the EVEN distribution style. Since this table may not participate in frequent joins and primarily serves as a log of actions performed during the reconciliation process, EVEN distribution ensures that the data is evenly spread across all node slices, balancing the load and allowing for efficient processing.

Applying the Distribution Styles in a Reconciliation Process

In this banking reconciliation scenario, let’s assume we need to reconcile internal and external transactions, convert amounts using the latest exchange rates, and log the reconciliation process.

  • The internal and external transactions will be joined on transaction_id. Since both tables use KEY distribution on transaction_id, the join operation will be efficient, as related data is already co-located.
  • Currency conversion will use the currency_exchange_rates table. With ALL distribution, a copy of this table is readily available on each node, ensuring fast lookups during the conversion process.
  • As actions are performed, logs are written to the audit_logs table, with EVEN distribution ensuring that logging operations are spread out evenly, preventing any single node from becoming a bottleneck.

This approach demonstrates how thoughtful selection of distribution styles can significantly enhance the performance and scalability of your data processing in Amazon Redshift, particularly in a complex, data-intensive scenario like banking reconciliation.

In Plain English 🚀

Thank you for being a part of the In Plain English community! Before you go: