Tag Archives: Interview Preparation

OLAP: The Continuum from Transactions (OLTP)

This article is my answer to many colleagues who often ask me, “If you were designing a data solution, what would you incorporate?” In real life, we rarely get the privilege of designing a solution from scratch, end to end. More often than not, we work with systems already designed and implemented by someone else. That, in itself, is a trade-off we all have to accept at some point. But nothing stops us from learning, reflecting, and studying from those systems — especially learning from the mistakes that might have been made.

This article is about exactly that: learning from what’s been done and thinking about what could have been improved. One key thing I’ve noticed in many data environments is the absence of a master data layer. Data analysts or data scientists often query directly from raw, unstructured snapshot data instead of working from properly curated master data. This leads to inefficient analysis and unreliable insights.

Let’s explore how a properly designed data flow can address these challenges using the OLTP-to-OLAP continuum.

1. Production Database Snapshots

In any data-driven system, the starting point is usually OLTP systems. These systems handle real-time transactions — whether it’s a customer placing an order, transferring money, or updating account details. Every operational activity generates data, and these systems are optimized to record that data at high speed.

However, while OLTP systems are excellent at handling transactions, they are not designed for complex data analysis. This is where production database snapshots come into play. These snapshots provide a periodic snapshot of the operational data, preserving the state of the system at a given moment. The key challenge here is what happens next: if you query directly from this raw snapshot, you’re likely to run into performance and consistency issues.

In an ideal scenario, we should move this snapshot data into a more structured format, setting the stage for accurate and meaningful analysis.

2. Master Data

This is where many data environments struggle. In the absence of a master data layer, analysts are forced to work with raw, inconsistent data. Master data provides a single source of truth, cleaning, organizing, and harmonizing disparate data sources.

For instance, imagine trying to analyze customer data across multiple products without a master data layer. Without a unified view of the customer, you end up with fragmented and sometimes contradictory data. This makes it harder to draw meaningful insights. The master data layer addresses this by creating consistent, well-organized records of key entities like customers, products, transactions, and more.

If I were designing a data solution, ensuring a solid master data layer would be one of my top priorities. This foundational layer improves the quality of data and ensures that all subsequent analyses are based on accurate, reliable information.

3. OLAP Cubes

Once the master data is set, the next step is processing it through OLAP cubes. OLAP systems are designed to handle complex, multidimensional queries that allow for deep analysis. For example, an OLAP cube might allow a company to analyze sales data by region, product category, and time period simultaneously.

The power of OLAP lies in its ability to aggregate data and provide quick access to insights across various dimensions. This is especially important in industries like finance, retail, or logistics, where understanding patterns and trends across different variables is critical for decision-making.

In many environments I’ve observed, OLAP systems are either underutilized or not implemented at all. This results in slow, inefficient analysis that can’t keep up with the speed of modern business. In contrast, using OLAP cubes to handle the heavy lifting of data aggregation ensures that insights can be generated faster and more efficiently.

4. Metrics

At the end of the continuum, we reach metrics — the ultimate output of the entire data pipeline. Whether it’s tracking sales performance, customer behavior, or operational efficiency, these metrics provide the actionable insights that drive business decisions.

However, the quality of these metrics depends entirely on the previous steps. Without proper data snapshots, master data, or OLAP cubes, the metrics generated will be unreliable. If each stage of the continuum is carefully managed, the metrics produced will be accurate and insightful, providing the information decision-makers need to act with confidence.


The key takeaway here is that in any data solution, from the most basic to the most complex, structure matters. A well-designed pipeline ensures that data flows smoothly from OLTP systems to OLAP analysis, ultimately providing the metrics needed to inform business strategy.

Stackademic 🎓

Thank you for reading until the end. Before you go:

SQL for Data Engineering: Window Functions and Common Table Expressions (CTEs)

This article is inspired by a true story involving one of my close friends. He’s a backend developer, not a database expert, but during a recent interview, he was grilled heavily on SQL. After hearing about his experience, I realized something that might resonate with many of you: the days when SQL knowledge was limited to basic GROUP BY and JOIN operations are long gone. Today, the depth of SQL skills required—especially in data engineering roles—demands much more. If you’re preparing for interviews, you’ll need to master more advanced concepts, like window functions and Common Table Expressions (CTEs), to truly stand out.

“In theory, there is no difference between theory and practice. But in practice, there is.” — Yogi Berra

Why Focus on Window Functions in SQL?

As my friend’s story reminded me, SQL interviews aren’t just about basic querying anymore. Window functions, in particular, have become a critical part of data engineering interviews. Whether it’s ranking transactions, calculating rolling metrics, or handling complex partitioning, window functions allow you to perform operations that basic SQL can’t easily handle.

Let’s start by breaking down window functions and why they’re essential in real-world scenarios, especially when working with large-scale data.

What is a Window Function?

A window function is a SQL tool that allows you to perform calculations across a set of rows that are somehow related to the current row. Think of it as looking at a “window” of surrounding rows to compute things like cumulative sums, ranks, or moving averages.

The most common window functions are:

  • LAG: Fetches the value from the previous row.
  • LEAD: Fetches the value from the next row.
  • RANK: Assigns ranks to rows, skipping numbers when there are ties.
  • DENSE_RANK: Similar to RANK but doesn’t skip numbers.
  • ROW_NUMBER: Assigns unique sequential numbers to rows, without ties.

These functions come in handy when dealing with tasks like analyzing customer transactions, calculating running totals, or ranking entries in financial datasets. Now, let’s move into a practical banking example that you might encounter in an interview.

Example: Identifying Top Three Transactions by Amount for Each Customer

Imagine you’re asked in an interview: “Find the top three largest transactions for each customer in the past year.” Right away, you should recognize that a simple GROUP BY or JOIN won’t work here—you’ll need a window function to rank transactions by amount for each customer.

Here’s how to approach it using the ROW_NUMBER function:

WITH customer_transactions AS (
SELECT customer_id, transaction_id, transaction_date, amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS transaction_rank
FROM transactions
WHERE transaction_date >= DATEADD(year, -1, GETDATE())
)
SELECT customer_id, transaction_id, transaction_date, amount
FROM customer_transactions
WHERE transaction_rank <= 3;

In this query:

  • The PARTITION BY clause divides the data into groups by customer.
  • The ORDER BY clause ranks the transactions based on the amount, from highest to lowest.
  • The ROW_NUMBER() function assigns a unique rank to each transaction for each customer, allowing you to filter out the top three for each.

This example goes beyond basic SQL skills, showcasing how window functions enable you to solve more complex real-world problems — something you’ll encounter frequently in interviews and on the job.

Keywords That Hint at Using Window Functions

In a SQL interview, look out for keywords that signal the need for window functions:

  • Rolling (e.g., rolling sum or average of balances)
  • Rank (e.g., top transactions, highest loan amounts)
  • Consecutive (e.g., consecutive late payments)
  • De-duplicate (e.g., identifying unique customer transactions)

For example, a question like “Give me the top five deposits per account over the past six months” is a clear indication that a window function — such as RANK or ROW_NUMBER—is required.

Understanding the Components of a Window Function

Each window function consists of three main components:

  1. Function: This could be something like SUM(), RANK(), or LAG().
  2. OVER() Clause: Defines the window, i.e., the rows across which the function is applied. Without this, it’s just a regular aggregate function. This is where PARTITION BY and ORDER BY come into play.
  3. Optional ROWS Clause: Rarely used but powerful when you need to calculate things like rolling averages or sums over a defined number of rows.

Let’s look at a practical example for a bank that wants to calculate the rolling 30-day balance for each customer’s account:

SELECT customer_id, transaction_date, 
SUM(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_balance
FROM transactions;
  • ROWS BETWEEN 29 PRECEDING AND CURRENT ROW defines a 30-day window for the balance calculation.
  • The result is a rolling sum of account balances over the last 30 days for each customer, a common requirement in banking data analysis.

Common Table Expressions (CTEs): Your Best Friend for Complex Queries

CTEs are another key tool in advanced SQL, especially for interviews. A CTE allows you to define a temporary result set that can be referenced within the main query, making your code more readable and maintainable.

Syntax of a CTE:

WITH cte_name AS (
SELECT column1, column2
FROM table
WHERE condition
)
SELECT *
FROM cte_name
WHERE another_condition;

Let’s extend our banking example. Suppose you’re asked to identify customers who have made consecutive late payments. Instead of cluttering your query with subqueries, you can simplify it using a CTE:

WITH customer_late_payments AS (
SELECT customer_id, payment_date,
LAG(payment_status) OVER (PARTITION BY customer_id ORDER BY payment_date) AS previous_payment_status
FROM payments
WHERE payment_status = 'Late'
)
SELECT customer_id, COUNT(*) AS consecutive_late_payments
FROM customer_late_payments
WHERE previous_payment_status = 'Late'
GROUP BY customer_id;

In this case, the LAG() function helps identify whether the previous payment was also marked as “Late.” This query identifies customers with consecutive late payments, a typical use case in risk management for banks.

When to Use CTEs vs. Subqueries vs. Temporary Tables

A common question that arises is when to use CTEs over subqueries or temporary tables. Here’s a quick rule of thumb:

  • CTEs: Ideal for improving readability and maintainability, especially in big data environments like Spark, Snowflake, or Trino.
  • Subqueries: Useful when you need a single scalar value, such as the total sum of loan amounts or average transaction size.
  • Temporary Tables: Best when you need to reuse intermediate results multiple times across queries, often improving performance in complex pipelines.

For example, if you’re working with millions of financial transactions and need to run multiple calculations, a temporary table could save significant processing time by storing intermediate results that are reused in other queries.


Mastering window functions and CTEs is your secret weapon in SQL interviews. These tools allow you to handle complex tasks like ranking transactions, calculating rolling balances, and identifying consecutive events — skills that will set you apart from other candidates.

By focusing on these advanced SQL techniques and understanding when to apply them, you’ll not only excel in interviews but also be prepared for the challenges you’ll face in real-world data analysis.

Stackademic 🎓

Thank you for reading until the end. Before you go:

Ace Your Data Engineering Interviews: A 6-Month Plan for Engineers and Managers

This article addresses the question, “If I want to prepare today, what should I do?” It offers a 6-month roadmap for aspiring and seasoned Data Engineers or Data Engineering Managers, including course recommendations. Keep in mind that the courses are not mandatory, and you should choose based on your availability and interest.

1. Pick Your Cloud Platform (AWS, Azure, GCP)

  • Duration: 60 days
  • Start by choosing a cloud platform based on your experience and background. It’s important to cover all the data-related services offered by the platform and understand their use cases and best practices.
  • If you’re aiming for a managerial role, you should also touch on well-architected frameworks, particularly those related to staging, ingestion, orchestration, transformation, and visualization.
  • Key Advice: Always include a focus on security, especially when dealing with sensitive data.

Some Useful Resources:

Data Engineering on AWS — The complete training

Data Lake in AWS — Easiest Way to Learn [2024]

Migration to AWS

Optional: Consider taking a Pluralsight Skill IQ or Role IQ test to assess where you stand in your knowledge journey at this stage. It’s a great way to identify areas where you need to focus more attention.

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” — Abraham Lincoln

2. Master SQL and Data Structures & Algorithms (DSA)

  • Duration: 30 days
  • SQL is the bread and butter of Data Engineering. Ensure you’ve practiced medium to complex SQL scenarios, focusing on real-world problems.
  • Alongside SQL, cover basic DSA concepts relevant to Data Engineering. You don’t need to delve as deep as a full-stack developer, but understanding a few key areas is crucial.

Key DSA Concepts to Cover:

  • Arrays and Strings: How to manipulate and optimize these data structures.
  • Hashmaps: Essential for efficiently handling large data sets.
  • Linked Lists and Trees: Useful for understanding hierarchical data.
  • Basic Sorting and Searching Algorithms: To optimize data processing tasks.

Some Useful Resources:

SQL for Data Scientists, Data Engineers and Developers

50Days of DSA JavaScript Data Structures Algorithms LEETCODE

3. Deep Dive into Data Lake and Data Warehousing

  • Duration: 30 days
  • A thorough understanding of Data Lakes and Data Warehousing is vital. Start with Apache Spark, which can be implemented using Databricks. For Data Warehousing, choose a platform like Redshift, Snowflake, or BigQuery.
  • I recommend focusing on Databricks and Snowflake as they are cloud-agnostic and offer flexibility across platforms.
  • Useful Resources:

Practical Lakehouse Architecture: Designing and Implementing Modern Data Platforms at Scale

4. Build Strong Foundations in Data Modeling

“In God we trust, all others must bring data.” — W. Edwards Deming

  • Duration: 30 days
  • Data Modeling is critical for designing efficient and scalable data systems. Focus on learning and practicing dimensional data models.
  • Useful Resources:

Data Modeling with Snowflake: A practical guide to accelerating Snowflake development using universal data modeling techniques

5. System Design and Architecture

“The best way to predict the future is to create it.” — Peter Drucker

  • Duration: 30 days
  • System design is an advanced topic that often comes up in interviews, especially for managerial roles. Re-design a large-scale project you’ve worked on and improve it based on well-architected principles.
  • Key Advice: Refer to Amazon customer case studies and engineering blogs from leading companies to make necessary changes to your architecture.
  • Useful Resources:

System Design Primer on GitHub

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Amazon Architecture Blog

6. Fine-Tune Your Resume and Prepare STAR Stories

“Opportunities don’t happen. You create them.” — Chris Grosser

  • Duration: 15 days
  • Now that you have built up your skills, it’s time to work on your resume. Highlight your accomplishments using the STAR method, focusing on customer-centric stories that showcase your experience.
  • Keep actively searching for jobs but avoid cold applications. Instead, try to connect with someone who can help you with a referral.

7. Utilize Referrals & LinkedIn Contacts

“Your network is your net worth.” — Porter Gale

Building connections and networking is crucial in landing a good job. Utilize LinkedIn and other platforms to connect with industry professionals. Remember to research the company thoroughly and understand their strengths, weaknesses, and key technologies before interviews.

  • Always tailor your job applications and resumes to the specific company and role.
  • Utilize your connections to gain insights and possibly a referral, which significantly increases your chances of getting hired.

8. Always Stay Prepared, Even If You’re Not Looking to Move

“Luck is what happens when preparation meets opportunity.” — Seneca

Even if you’re actively working somewhere and not planning to change jobs, it’s wise to stay prepared. In many cases, workplace politics can overshadow skills, and in such scenarios, the quality of empathy may be lacking. Often, self-preservation takes precedence over team or skilled resources, so it’s important to always be ready to seize new opportunities if they arise.

This roadmap offers a structured approach to mastering the necessary skills for Data Engineering and Data Engineering Manager roles within six months. It’s designed to be flexible — feel free to adjust the timeline based on your current experience and availability. Remember, the key to success lies in consistent practice, continuous learning, and proactive networking.

“The only limit to our realization of tomorrow is our doubts of today.” — Franklin D. Roosevelt

Good luck and best wishes in achieving your career goals!

Stackademic 🎓

Thank you for reading until the end. Before you go:

Want to Get Better at System Design Interviews? Here’s How to Prepare

System design interviews can be daunting due to their complexity and the vast knowledge required to excel. Whether you’re a recent graduate or a seasoned engineer, preparing for these interviews necessitates a well-thought-out strategy and access to the right resources. In this article, I’ll guide you to navigate the system design landscape and equip you to succeed in your upcoming interviews.

Start with the Basics

“Web Scalability for Startup Engineers” by Artur Ejsmont — This book is recommended as a starting point for beginners in system design.

“Designing Data-Intensive Applications” by Martin Kleppmann is described as a more in-depth resource for those with a basic understanding of system design.

It’s essential to establish a strong foundation before delving too deep into a subject. For beginners, “Web Scalability for Startup Engineers” is an excellent resource. It covers the basics and prepares you for more advanced concepts. After mastering the fundamentals, “Designing Data-Intensive Applications” by Martin Kleppmann will guide you further into data systems.

Microservices and Domain-Driven Design

“Building Microservices” by Sam Newman — Focuses on microservices architecture and its implications in system design.

Once you are familiar with the fundamentals, the next step is to explore the intricacies of the microservices architectural style through “Building Microservices.” To gain a deeper understanding of practical patterns and design principles, “Microservices Patterns and Best Practices” is an excellent resource. Lastly, for those who wish to understand the philosophy behind system architecture, “Domain-Driven Design” is a valuable read.

API Design and gRPC

“RESTful Web APIs” by Leonard Richardson, Mike Amundsen, and Sam Ruby provides a comprehensive guide to developing web-based APIs that adhere to the REST architectural style.

In the present world, APIs serve as the main connecting point of the internet. If you intend to design effective APIs, a good starting point would be to refer to “RESTful Web APIs” by Leonard Richardson and his colleagues. Moreover, if you are exploring the Remote Procedure Call (RPC) genre, particularly gRPC, then “gRPC: Up and Running” is a comprehensive guide.

Preparing for the Interview

“System Design Interview — An Insider’s Guide” by Alex Xu is an essential book for those preparing for challenging system design interviews.

It offers a comprehensive look at the strategies and thought processes required to navigate these complex discussions. Although it is one of many resources candidates will need, the book is tailored to equip them with the means to dissect and approach real interview questions. The book blends technical knowledge with the all-important communicative skills, preparing candidates to think on their feet and articulate clear and effective system design solutions. Xu’s guide demystifies the interview experience, providing a rich set of examples and insights to help candidates prepare for the interview process.

Domain-Specific Knowledge

Enhance your knowledge in your domain with books such as “Kafka: The Definitive Guide” for Distributed Messaging and “Cassandra: The Definitive Guide” for understanding wide-column stores. “Designing Event-Driven Systems” is crucial for grasping event sourcing and services using Kafka.

General Product Design

Pay attention to product design in system design. Books like “The Design of Everyday Things” and “Hooked: How to Build Habit-Forming Products” teach user-centric design principles, which are increasingly crucial in system design.

Online Resources

The internet is a goldmine of information. You can watch tech conference talks, follow YouTube channels such as Gaurav Sen’s System Design Interview and read engineering blogs from companies like Uber, Netflix, and LinkedIn.


System design is an iterative learning process that blends knowledge, curiosity, and experience. The resources provided here are a roadmap to guide you through this journey. With the help of these books and resources, along with practice and reflection, you will be well on your way to mastering system design interviews. Remember, it’s not just about understanding system design but also about thinking like a system designer.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

AWS-Based URL Shortener: Design, Logic, and Scalability

Here’s a behind-the-scenes look at creating a URL-shortening service using Amazon Web Services (AWS).

Users and System Interaction:

  • User Requests: Users submit a long web address wanting a shorter version, or they might want to use a short link to reach the original website or remove a short link.
  • API Gateway: This is AWS’s reception. It directs user requests to the right service inside AWS.
  • Lambda Functions: These are the workers. They perform tasks like making a link shorter, retrieving the original from a short link, or deleting a short link.
  • DynamoDB: This is the storage room. All the long and short web addresses are stored here.
  • ElastiCache: Before heading to DynamoDB, the system checks here first when users access a short link. It’s faster.
  • VPC & Subnets: This is the AWS structure. The welcoming part (API Gateway) is public, while sensitive data (DynamoDB) is kept private and secure.

Making Links Shorter for Users:

  • Sequential Counting: Every web link gets a unique number. To keep it short, that number is converted into a combination of letters and numbers.
  • Hashing: The system also shortens the long web address into a fixed-length string. This method may produce similar results for different links, but the system manages and differentiates them efficiently.

Sequential Counting: This takes a long URL as input and uses a unique counter value from the database to generate a short URL.

For instance, the URL https://example.com/very-long-url might be shortened to https://short.url/1234AB using a unique number from the database, then converting this number into a mix of letters and numbers.

Hashing: This involves taking a long URL and converting it to a fixed-size string of characters using a hashing algorithm. So, https://example.com/very-long-url could become https://short.url/h5Gk9.

The rationale for Combining:

  1. Enhanced Uniqueness & Collision Handling: Sequential counting ensures uniqueness, and in the unlikely event of a hashing collision, the sequential identifier can be used as a fallback or combined with the hash.
  2. Balancing Predictability & Compactness: Hashing gives compact URLs, and by adding a sequential component, we reduce predictability.
  3. Scalability & Performance: Sequential lookups are faster. If the hash table grows large, the performance could degrade due to hash collisions. Combining with sequential IDs ensures fast retrievals.

Lambda Function for Shortening (PUT Request)

  1. Input: Long URL e.g. “https://www.example.com/very-long-url
  2. URL Exists: Retrieved Shortened URL e.g. “abcd12”
  3. Hash URL: Output e.g. “a1b2c3”
  4. Assign Number: Unique Sequential Number e.g. “456”
  5. Combine Hash & Number: e.g. “a1b2c3456”
  6. Store in DynamoDB: {“https://www.example.com/very-long-url“: “a1b2c3456”}
  7. Update ElastiCache: {“a1b2c3456”: “https://www.example.com/very-long-url”}
  8. Return to API Gateway: Shortened URL e.g. “a1b2c3456”

Lambda Function for Redirecting (GET Request)

  • Input: The user provides a short URL like “a1b2c3456”.
  • Check-in ElastiCache: System looks up the short URL in ElastiCache.
  • Cache Hit: If the Long URL is found in the cache, the system retrieves it directly.
  • Cache Miss: If not in the cache, the system searches in DynamoDB.
  • Check-in DynamoDB: Searches the DynamoDB for the corresponding Long URL.
  • URL Found: The Long URL matching the given short URL is found, e.g. “https://www.example.com/very-long-url“.
  • Update ElastiCache: System updates the cache with {“a1b2c3456”: “https://www.example.com/very-long-url”}.
  • Return to API Gateway: The system redirects users to the original Long URL.

Lambda Function for Deleting (DELETE Request)

  • Input: The user provides a short URL they want to delete.
  • Check-in DynamoDB: System looks up the short URL in DynamoDB.
  • URL Found: If the URL mapping for the short URL is found, it proceeds to deletion.
  • Delete from DynamoDB: The system deletes the URL mapping from DynamoDB.
  • Clear from ElastiCache: The System also clears the URL mapping from the cache to ensure that the short URL no longer redirects users.
  • Return Confirmation to API Gateway: After the deletion is successful, a confirmation is sent to the API Gateway, confirming the user about the deletion.

Simple Math Behind Our URL Shortening (Envelope Estimation):

When we use a 6-character mix of letters (both small and capital) and numbers for our short URLs, we have about 56.8 billion different combinations. If users create 100 million short links every day, we can keep making unique links for over 500 days without repeating them.

In Plain English

Thank you for being a part of our community! Before you go: