Tag Archives: Data Engineer Interview

SQL for Data Engineering: Window Functions and Common Table Expressions (CTEs)

This article is inspired by a true story involving one of my close friends. He’s a backend developer, not a database expert, but during a recent interview, he was grilled heavily on SQL. After hearing about his experience, I realized something that might resonate with many of you: the days when SQL knowledge was limited to basic GROUP BY and JOIN operations are long gone. Today, the depth of SQL skills required—especially in data engineering roles—demands much more. If you’re preparing for interviews, you’ll need to master more advanced concepts, like window functions and Common Table Expressions (CTEs), to truly stand out.

“In theory, there is no difference between theory and practice. But in practice, there is.” — Yogi Berra

Why Focus on Window Functions in SQL?

As my friend’s story reminded me, SQL interviews aren’t just about basic querying anymore. Window functions, in particular, have become a critical part of data engineering interviews. Whether it’s ranking transactions, calculating rolling metrics, or handling complex partitioning, window functions allow you to perform operations that basic SQL can’t easily handle.

Let’s start by breaking down window functions and why they’re essential in real-world scenarios, especially when working with large-scale data.

What is a Window Function?

A window function is a SQL tool that allows you to perform calculations across a set of rows that are somehow related to the current row. Think of it as looking at a “window” of surrounding rows to compute things like cumulative sums, ranks, or moving averages.

The most common window functions are:

  • LAG: Fetches the value from the previous row.
  • LEAD: Fetches the value from the next row.
  • RANK: Assigns ranks to rows, skipping numbers when there are ties.
  • DENSE_RANK: Similar to RANK but doesn’t skip numbers.
  • ROW_NUMBER: Assigns unique sequential numbers to rows, without ties.

These functions come in handy when dealing with tasks like analyzing customer transactions, calculating running totals, or ranking entries in financial datasets. Now, let’s move into a practical banking example that you might encounter in an interview.

Example: Identifying Top Three Transactions by Amount for Each Customer

Imagine you’re asked in an interview: “Find the top three largest transactions for each customer in the past year.” Right away, you should recognize that a simple GROUP BY or JOIN won’t work here—you’ll need a window function to rank transactions by amount for each customer.

Here’s how to approach it using the ROW_NUMBER function:

WITH customer_transactions AS (
SELECT customer_id, transaction_id, transaction_date, amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS transaction_rank
FROM transactions
WHERE transaction_date >= DATEADD(year, -1, GETDATE())
)
SELECT customer_id, transaction_id, transaction_date, amount
FROM customer_transactions
WHERE transaction_rank <= 3;

In this query:

  • The PARTITION BY clause divides the data into groups by customer.
  • The ORDER BY clause ranks the transactions based on the amount, from highest to lowest.
  • The ROW_NUMBER() function assigns a unique rank to each transaction for each customer, allowing you to filter out the top three for each.

This example goes beyond basic SQL skills, showcasing how window functions enable you to solve more complex real-world problems — something you’ll encounter frequently in interviews and on the job.

Keywords That Hint at Using Window Functions

In a SQL interview, look out for keywords that signal the need for window functions:

  • Rolling (e.g., rolling sum or average of balances)
  • Rank (e.g., top transactions, highest loan amounts)
  • Consecutive (e.g., consecutive late payments)
  • De-duplicate (e.g., identifying unique customer transactions)

For example, a question like “Give me the top five deposits per account over the past six months” is a clear indication that a window function — such as RANK or ROW_NUMBER—is required.

Understanding the Components of a Window Function

Each window function consists of three main components:

  1. Function: This could be something like SUM(), RANK(), or LAG().
  2. OVER() Clause: Defines the window, i.e., the rows across which the function is applied. Without this, it’s just a regular aggregate function. This is where PARTITION BY and ORDER BY come into play.
  3. Optional ROWS Clause: Rarely used but powerful when you need to calculate things like rolling averages or sums over a defined number of rows.

Let’s look at a practical example for a bank that wants to calculate the rolling 30-day balance for each customer’s account:

SELECT customer_id, transaction_date, 
SUM(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_balance
FROM transactions;
  • ROWS BETWEEN 29 PRECEDING AND CURRENT ROW defines a 30-day window for the balance calculation.
  • The result is a rolling sum of account balances over the last 30 days for each customer, a common requirement in banking data analysis.

Common Table Expressions (CTEs): Your Best Friend for Complex Queries

CTEs are another key tool in advanced SQL, especially for interviews. A CTE allows you to define a temporary result set that can be referenced within the main query, making your code more readable and maintainable.

Syntax of a CTE:

WITH cte_name AS (
SELECT column1, column2
FROM table
WHERE condition
)
SELECT *
FROM cte_name
WHERE another_condition;

Let’s extend our banking example. Suppose you’re asked to identify customers who have made consecutive late payments. Instead of cluttering your query with subqueries, you can simplify it using a CTE:

WITH customer_late_payments AS (
SELECT customer_id, payment_date,
LAG(payment_status) OVER (PARTITION BY customer_id ORDER BY payment_date) AS previous_payment_status
FROM payments
WHERE payment_status = 'Late'
)
SELECT customer_id, COUNT(*) AS consecutive_late_payments
FROM customer_late_payments
WHERE previous_payment_status = 'Late'
GROUP BY customer_id;

In this case, the LAG() function helps identify whether the previous payment was also marked as “Late.” This query identifies customers with consecutive late payments, a typical use case in risk management for banks.

When to Use CTEs vs. Subqueries vs. Temporary Tables

A common question that arises is when to use CTEs over subqueries or temporary tables. Here’s a quick rule of thumb:

  • CTEs: Ideal for improving readability and maintainability, especially in big data environments like Spark, Snowflake, or Trino.
  • Subqueries: Useful when you need a single scalar value, such as the total sum of loan amounts or average transaction size.
  • Temporary Tables: Best when you need to reuse intermediate results multiple times across queries, often improving performance in complex pipelines.

For example, if you’re working with millions of financial transactions and need to run multiple calculations, a temporary table could save significant processing time by storing intermediate results that are reused in other queries.


Mastering window functions and CTEs is your secret weapon in SQL interviews. These tools allow you to handle complex tasks like ranking transactions, calculating rolling balances, and identifying consecutive events — skills that will set you apart from other candidates.

By focusing on these advanced SQL techniques and understanding when to apply them, you’ll not only excel in interviews but also be prepared for the challenges you’ll face in real-world data analysis.

Stackademic 🎓

Thank you for reading until the end. Before you go:

Apache Spark Optimizations: Shuffle Join Vs. Broadcast Joins

Apache Spark is an analytics engine that processes large-scale data in distributed computing environments. It offers various join operations that are essential for handling big data efficiently. In this article, we will take an in-depth look at Spark joins. We will compare normal (shuffle) joins with broadcast joins and explore other available join types.

Understanding Joins in Apache Spark

Joins in Apache Spark are operations that combine rows from two or more datasets based on a common key. The way Spark processes these joins, especially in a distributed setting, significantly impacts the performance and scalability of data processing tasks.

Normal Join (Shuffle Join)

Operation

  • Shuffling: Both data sets are shuffled across the cluster based on the join keys in a regular join. This process involves transferring data across different nodes, which can be network-intensive.

Use Case

  • Large Datasets: Best for large datasets where both sides of the join have substantial data. Typically used when no dataset is small enough to fit in the memory of a single worker node.

Performance Considerations

  • Speed: This can be slower due to extensive data shuffling.
  • Bottlenecks: Network I/O and disk writes during shuffling can become bottlenecks, especially for large datasets.

Scalability

  • Large Data Handling: Scales well with large datasets but can only be efficient if optimized (e.g., by partitioning).

Broadcast Join

Operation

  • Broadcasting: In a broadcast join, the smaller dataset is sent (broadcast) to each node in the cluster, avoiding shuffling the larger dataset.

Use Case

  • Uneven Data Size: Ideal when one dataset is significantly smaller than the other. The smaller dataset should be small enough to fit in the memory of each worker node.

Performance Considerations

  • Speed: Typically faster than normal joins as it minimizes data shuffling.
  • Reduced Overhead: Reduces network I/O and disk write overhead.

Scalability

  • Small Dataset Joins: Works well for joins with small datasets but is not suitable for large to large dataset joins.

Choosing Between Normal and Broadcast Joins

The choice between a normal join and a broadcast join in Apache Spark largely depends on the size of the datasets involved and the resources of your Spark cluster. Here are key factors to consider:

  1. Data Size and Distribution: A broadcast join is usually more efficient if one dataset is small. For two large datasets, a normal join is more suitable.
  2. Cluster Resources: Consider the memory and network resources of your Spark cluster.
  3. Tuning and Optimization: Spark’s ability to optimize queries (e.g., Catalyst optimizer) can sometimes automatically choose the best join strategy.

The configurations that affect broadcast joins in Apache Spark:

spark.sql.autoBroadcastJoinThreshold: Sets the maximum size of a table that can be broadcast. Lowering the threshold can disable broadcast joins for larger tables, while increasing it can enable broadcast joins for bigger tables.

spark.sql.broadcastTimeout: Determines the timeout for the broadcast operation. Spark may fallback to a shuffle join if the broadcast takes longer than this.

spark.driver.memory: Allocates memory to the driver. Since the driver coordinates broadcasting, insufficient memory can hinder the broadcast join operation.


Other Types of Joins in Spark

Apart from normal and broadcast joins, Spark supports several other join types:

  1. Inner Join: Combines rows from both datasets where the join condition is met.
  2. Outer Joins: Includes left outer, right outer, and full outer joins, which retain rows from one or both datasets even when no matching join key is found.
  3. Cross Join (Cartesian Join): Produces a Cartesian product of the rows from both datasets. Every row of the first dataset is joined with every row of the second dataset, often resulting in many rows.
  4. Left Semi Join: This join type returns only the rows from the left dataset for which there is a corresponding row in the right dataset, but the columns from the right dataset are not included in the output.
  5. Left Anti Join: Opposite to the Left Semi Join, this join type returns rows from the left dataset that do not have a corresponding row in the right dataset.
  6. Self Join: This is not a different type of join per se, but rather a technique where a dataset is joined with itself. This can be useful for hierarchical or sequential data analysis.

Considerations for Efficient Use of Joins in Spark

When implementing joins in Spark, several considerations can help optimize performance:

  1. Data Skewness: Imbalanced data distribution can lead to skewed processing across the cluster. Addressing data skewness is crucial for maintaining performance.
  2. Memory Management: Ensuring that the broadcasted dataset fits into memory is crucial for broadcast joins. Out-of-memory errors can significantly impact performance.
  3. Join Conditions: Optimal use of join keys and conditions can reduce the amount of data shuffled across the network.
  4. Partitioning and Clustering: Proper partitioning and clustering of data can enhance join performance by minimizing data movement.
  5. Use of DataFrames and Datasets: Utilizing DataFrames and Datasets API for joins can leverage Spark’s Catalyst Optimizer for better execution plans.

PySpark code for the broadcast join:

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

# Initialize Spark Session
spark = SparkSession.builder \
.appName("BroadcastJoinExample") \
.getOrCreate()

# Sample DataFrames
df_large = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value"])
df_small = spark.createDataFrame([(1, "X"), (2, "Y")], ["id", "value2"])

# Perform Broadcast Join
joined_df = df_large.join(broadcast(df_small), "id")

# Show the result
joined_df.show()

# Explain Plan
joined_df.explain()

# Stop the Spark Session
spark.stop()

Stackademic

Thank you for reading until the end. Before you go: