Tag Archives: Sql

SQL for Data Engineering: Window Functions and Common Table Expressions (CTEs)

Why Focus on Window Functions in SQL?

As my friend’s story reminded me, SQL interviews aren’t just about basic querying anymore. Window functions, in particular, have become a critical part of data engineering interviews. Whether it’s ranking transactions, calculating rolling metrics, or handling complex partitioning, window functions allow you to perform operations that basic SQL can’t easily handle.

Let’s start by breaking down window functions and why they’re essential in real-world scenarios, especially when working with large-scale data.

What is a Window Function?

A window function is a SQL tool that allows you to perform calculations across a set of rows that are somehow related to the current row. Think of it as looking at a “window” of surrounding rows to compute things like cumulative sums, ranks, or moving averages.

The most common window functions are:

LAG: Fetches the value from the previous row.
LEAD: Fetches the value from the next row.
RANK: Assigns ranks to rows, skipping numbers when there are ties.
DENSE_RANK: Similar to RANK but doesn’t skip numbers.
ROW_NUMBER: Assigns unique sequential numbers to rows, without ties.

These functions come in handy when dealing with tasks like analyzing customer transactions, calculating running totals, or ranking entries in financial datasets. Now, let’s move into a practical banking example that you might encounter in an interview.

Example: Identifying Top Three Transactions by Amount for Each Customer

Imagine you’re asked in an interview: “Find the top three largest transactions for each customer in the past year.” Right away, you should recognize that a simple GROUP BY or JOIN won’t work here—you’ll need a window function to rank transactions by amount for each customer.

Here’s how to approach it using the ROW_NUMBER function:

WITH customer_transactions AS (
    SELECT customer_id, transaction_id, transaction_date, amount,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS transaction_rank
    FROM transactions
    WHERE transaction_date >= DATEADD(year, -1, GETDATE())
)
SELECT customer_id, transaction_id, transaction_date, amount
FROM customer_transactions
WHERE transaction_rank <= 3;

In this query:

The PARTITION BY clause divides the data into groups by customer.
The ORDER BY clause ranks the transactions based on the amount, from highest to lowest.
The ROW_NUMBER() function assigns a unique rank to each transaction for each customer, allowing you to filter out the top three for each.

This example goes beyond basic SQL skills, showcasing how window functions enable you to solve more complex real-world problems — something you’ll encounter frequently in interviews and on the job.

Keywords That Hint at Using Window Functions

In a SQL interview, look out for keywords that signal the need for window functions:

Rolling (e.g., rolling sum or average of balances)
Rank (e.g., top transactions, highest loan amounts)
Consecutive (e.g., consecutive late payments)
De-duplicate (e.g., identifying unique customer transactions)

For example, a question like “Give me the top five deposits per account over the past six months” is a clear indication that a window function — such as RANK or ROW_NUMBER—is required.

Understanding the Components of a Window Function

Each window function consists of three main components:

Function: This could be something like SUM(), RANK(), or LAG().
OVER() Clause: Defines the window, i.e., the rows across which the function is applied. Without this, it’s just a regular aggregate function. This is where PARTITION BY and ORDER BY come into play.
Optional ROWS Clause: Rarely used but powerful when you need to calculate things like rolling averages or sums over a defined number of rows.

Let’s look at a practical example for a bank that wants to calculate the rolling 30-day balance for each customer’s account:

SELECT customer_id, transaction_date, 
       SUM(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date 
                         ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_balance
FROM transactions;

ROWS BETWEEN 29 PRECEDING AND CURRENT ROW defines a 30-day window for the balance calculation.
The result is a rolling sum of account balances over the last 30 days for each customer, a common requirement in banking data analysis.

Common Table Expressions (CTEs): Your Best Friend for Complex Queries

CTEs are another key tool in advanced SQL, especially for interviews. A CTE allows you to define a temporary result set that can be referenced within the main query, making your code more readable and maintainable.

Syntax of a CTE:

WITH cte_name AS (
    SELECT column1, column2
    FROM table
    WHERE condition
)
SELECT *
FROM cte_name
WHERE another_condition;

Let’s extend our banking example. Suppose you’re asked to identify customers who have made consecutive late payments. Instead of cluttering your query with subqueries, you can simplify it using a CTE:

WITH customer_late_payments AS (
    SELECT customer_id, payment_date, 
           LAG(payment_status) OVER (PARTITION BY customer_id ORDER BY payment_date) AS previous_payment_status
    FROM payments
    WHERE payment_status = 'Late'
)
SELECT customer_id, COUNT(*) AS consecutive_late_payments
FROM customer_late_payments
WHERE previous_payment_status = 'Late'
GROUP BY customer_id;

In this case, the LAG() function helps identify whether the previous payment was also marked as “Late.” This query identifies customers with consecutive late payments, a typical use case in risk management for banks.

When to Use CTEs vs. Subqueries vs. Temporary Tables

A common question that arises is when to use CTEs over subqueries or temporary tables. Here’s a quick rule of thumb:

CTEs: Ideal for improving readability and maintainability, especially in big data environments like Spark, Snowflake, or Trino.
Subqueries: Useful when you need a single scalar value, such as the total sum of loan amounts or average transaction size.
Temporary Tables: Best when you need to reuse intermediate results multiple times across queries, often improving performance in complex pipelines.

For example, if you’re working with millions of financial transactions and need to run multiple calculations, a temporary table could save significant processing time by storing intermediate results that are reused in other queries.

Mastering window functions and CTEs is your secret weapon in SQL interviews. These tools allow you to handle complex tasks like ranking transactions, calculating rolling balances, and identifying consecutive events — skills that will set you apart from other candidates.

By focusing on these advanced SQL techniques and understanding when to apply them, you’ll not only excel in interviews but also be prepared for the challenges you’ll face in real-world data analysis.

Stackademic 🎓

Thank you for reading until the end. Before you go:

Please consider clapping and following the writer! 👏
Follow us X | LinkedIn | YouTube | Discord
Visit our other platforms: In Plain English | CoFeed | Differ
More content at Stackademic.com

Couchbase 101: Introduction

Couchbase Architecture & Features

Couchbase is built on a memory-first architecture which prioritizes in-memory processing to achieve high performance. Whenever a new item is saved, it is initially stored in memory, and associated data with Couchbase buckets is maintained persistently on disk.

Couchbase Server has a memory-first design for fast data access. Its active memory defragmenter optimizes performance. It supports a flexible JSON data model and direct in-memory access to its key-value engine for modern performance demands.

Consistency Models and N1QL Query Language

Let’s explore how Couchbase elegantly navigates the classic trade-offs in distributed systems, balancing Consistency, Availability, and Partition Tolerance, also known as the CAP theorem.

Couchbase offers a strategic shift from CP to AP, a choice that depends on your deployment topology and the desired system behaviour.

For a Single Cluster setup, Couchbase operates as a CP system, emphasizing strong consistency and partition tolerance, ensuring that your data remains accurate and synchronized across your cluster, which is ideal for scenarios where every transaction counts.

On the other hand, in a multi-cluster setup with cross-datacenter replication, abbreviated as XDCR, Couchbase adopts an AP approach, prioritizing availability over immediate consistency. This model is perfect for applications where uptime is critical, and data can eventually synchronize across clusters.

Highlighting its advanced capabilities, the N1QL Query Language in Couchbase now supports Distributed ACID Transactions.

This means you can perform complex transactions across your distributed Database with the assurance of atomicity, consistency, isolation, and durability — the cornerstone of reliable database management.

With these features, Couchbase ensures that your data is distributed, resilient and intelligently managed to meet the various demands of modern applications.

Concept 1 — Your Choice of Services

As we unfold the pages of Couchbase’s architecture, Concept 1 highlights ‘Your Choice of Services’. This high-level overview showcases the modular and resilient design of Couchbase, which empowers you to tailor your database architecture to your application’s specific needs.

Starting with Cluster Management, Couchbase offers a distributed architecture with no single point of failure, ensuring your system’s high availability. Automatic sharding through vBuckets ensures load balancing and scalability, while Cross Data Center Replication(XDCR) offers geographical redundancy.

The Data Service is the backbone, providing robust key-value storage with in-cluster and cross-data centre replication capabilities, all accelerated by a built-in cache for high performance.

Moving on to the Query Service, here we have the powerful N1QL, or SQL for JSON, for planning and executing queries, supporting JOINS, aggregations, and subqueries, giving you the flexibility of SQL with the power of JSON.

The Index Service seamlessly manages N1QL index creation, update, replication, and maintenance, while the Search Service provides comprehensive full-text indexing and search support.

Analytics Service offers isolated, distributed queries for long-running analytical operations without affecting your operational database performance.

Finally, the Eventing Service introduces event-driven data management, allowing you to respond to data mutations quickly.

Together, these services form a cohesive framework that stores and manages your data and integrates with your application logic for a seamless experience.

Each node or server in Couchbase is identical and capable of housing data in any configuration necessary to meet your application’s demands. As we see here, the four nodes are interconnected to form what is traditionally known as a cluster.

What’s unique about Couchbase is its flexibility in configuring these nodes. Depending on your changing capacity requirements, you can assign more or fewer resources to specific services. This adaptability is illustrated in our diagram, showing a variety of possible configurations.

In the first configuration, all services, from Data to Analytics, are distributed evenly across all nodes, ensuring a balanced workload and optimal utilization of resources.

In the second configuration, you can see that we’ve scaled up the Data Service across nodes to accommodate a heavier data load, demonstrating Couchbase’s agility in resource allocation.

The third configuration takes a specialized approach, with each node dedicated to a specific service, optimizing for intense workloads and dedicated tasks.

This level of customization ensures that as your application grows and evolves, your Couchbase cluster can adapt seamlessly, providing consistent performance and reliability.

Couchbase’s design philosophy is to provide you with the tools to build a database cluster that’s as dynamic as your business needs, without compromising on performance, availability, or scalability.

What is a Bucket?

In Couchbase, keys and documents are stored in a Bucket.

A Couchbase Bucket* stores data persistently, as well as in memory.
Buckets allow data to be automatically replicated for high availability and dynamically scaled across multiple databases by means of Cross Datacenter Replication (XDCR)

Bucket Storage

A Couchbase Database consists of one or more instances of Couchbase, each running a set of services, including the Data Service.

Each Bucket is sharded equally and automatically among the Servers, also known as Nodes, in the Database.

Bucket Composition

Within each Bucket are 1024 vBuckets, also known as shards, spread out equally and automatically only on Data nodes. Couchbase refers to this automatic distribution as auto-sharding.

VBuckets: Stores a subset of the total data set. Allow for horizontal scalability.

Concept 2 — Automatic Sharding

Concept 2 is centred on ‘Automatic Sharding’ — a pivotal feature of Couchbase that addresses the challenges of managing a growing dataset. As the volume of data increases, the need for efficient management becomes crucial. Couchbase rises to the occasion by automatically partitioning data across multiple nodes within the cluster, a technique known as sharding. This approach guarantees a balanced distribution of data, which is instrumental in enhancing both performance and scalability.

The mechanism behind this is the implementation of vBuckets or virtual buckets. These vBuckets are designed to distribute data evenly across all nodes, thus empowering horizontal scaling and bolstering fault tolerance and recovery. For developers, this means simplicity and ease of use, as the complexity of sharding is abstracted away, allowing them to concentrate on what they do best — building outstanding applications, assured that the data layer will scale as needed without any extra intervention.

Concept 3 — Database Change Protocol

Core Function: DCP (Database Change Protocol) is a key replication protocol in Couchbase Server, connecting nodes and clusters across different data centres.

DCP (Database Change Protocol)

Key Features: Includes ordered mutations, optimized restarts post-failures, efficient, consistent snapshot production, and eager changes streaming.

Concept 3 introduces the ‘Database Change Protocol’ at the heart of Couchbase’s real-time replication and synchronization capabilities. This protocol ensures that changes made to the Database are captured and communicated efficiently across different system parts.

Whether for cache invalidation, index maintenance, or cross-data centre replication, the Database Change Protocol ensures that all components of your Couchbase deployment stay in sync. This mechanism is crucial for maintaining data consistency, especially in distributed environments, and it supports Couchbase’s high availability and resilience promises to your applications.

Stackademic

Thank you for reading until the end. Before you go:

Please consider clapping and following the writer! 👏
Follow us on Twitter(X), LinkedIn, and YouTube.
Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Shanoj

Learn.Share.Grow