Monthly Archives: November 2023

Apache Spark 101: Window Functions

The process begins with data partitioning to enable distributed processing, followed by shuffling and sorting data within partitions and executing partition-level operations. Spark employs a lazy evaluation strategy that postpones the actual computation until an action triggers it. At this point, the Catalyst Optimizer refines the execution plan, including specific optimizations for window functions. The Tungsten Execution Engine executes the plan, optimizing for memory and CPU usage and completing the process.

Apache Spark offers a robust collection of window functions, allowing users to conduct intricate calculations and analysis over a set of input rows. These functions improve the flexibility of Spark’s SQL and DataFrame APIs, simplifying the execution of advanced data manipulation and analytics.

Understanding Window Functions

Window functions in Spark allow users to perform calculations across a set of rows related to the current row. These functions operate on a window of input rows and are particularly useful for ranking, aggregation, and accessing data from adjacent rows without using self-joins.

Common Window Functions

Rank:

  • Assigns a ranking within a window partition, with gaps for tied values.
  • If two rows are tied for rank 1, the next rank will be 3, reflecting the ties in the sequence.

Advantages: Handles ties and provides a clear ranking context.

Disadvantages: Gaps may cause confusion; less efficient on larger datasets due to ranking calculations.

Dense Rank:

  • Operates like rank but without gaps in the ranking order.
  • Tied rows receive the same rank, and the subsequent rank number is consecutive.

Advantages: No gaps in ranking, continuous sequence.

Disadvantages: Less distinction in ties; can be computationally intensive.

Row Number:

  • Gives a unique sequential identifier to each row in a partition.
  • It does not account for ties, as each row is given a distinct number.

Advantages: Assigns a unique identifier; generally faster than rank and dense_rank.

Disadvantages: No tie handling; sensitive to order, which can affect the outcome.

Lead:

  • Provides access to subsequent row data, valid for comparisons with the current row.
  • lead(column, 1) returns the value of column from the next row.

Lag:

  • Retrieves data from the previous row, allowing for retrospective comparison.
  • lag(column, 1) fetches the value of column from the preceding row.

Advantages: Allows for forward and backward analysis; the number of rows to look ahead or back can be specified.

Disadvantages: Edge cases where null is returned for rows without subsequent or previous data; dependent on row order.

These functions are typically used with the OVER clause to define the specifics of the window, such as partitioning and ordering of the rows.

Here’s a simple example to illustrate:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F

# Initialize Spark session
spark = SparkSession.builder.appName("windowFunctionsExample").getOrCreate()

# Sample data
data = [("Alice", "Sales", 3000),
("Bob", "Sales", 4600),
("Charlie", "Finance", 5200),
("David", "Sales", 3000),
("Edward", "Finance", 4100)]

# Create DataFrame
columns = ["EmployeeName", "Department", "Salary"]
df = spark.createDataFrame(data, columns)

# Define window specification
windowSpec = Window.partitionBy("Department").orderBy("Salary")

# Apply window functions
df.withColumn("rank", F.rank().over(windowSpec)) \
.withColumn("dense_rank", F.dense_rank().over(windowSpec)) \
.withColumn("row_number", F.row_number().over(windowSpec)) \
.withColumn("lead", F.lead("Salary", 1).over(windowSpec)) \
.withColumn("lag", F.lag("Salary", 1).over(windowSpec)) \
.show()

Common Elements in Spark’s Architecture for Window Functions:

Lazy Evaluation:

Spark’s transformation operations, including window functions, are lazily evaluated. This means the actual computation happens only when an action (like collect or show) is called. This approach allows Spark to optimize the execution plan.

Lazy evaluation in Spark’s architecture allows for the postponement of computations until they are required, enabling the system to optimize the execution plan and minimize unnecessary processing. This approach is particularly beneficial when working with window functions, as it allows Spark to efficiently handle complex calculations and analysis over a range of input rows. The benefits of lazy evaluation in the context of window functions include reduced unnecessary computations, optimized query plans, minimized data movement, and the ability to enable pipelining for efficient task scheduling.

Catalyst Optimizer:

The Catalyst Optimizer applies a series of optimization techniques to enhance query execution time, some particularly relevant to window functions. These optimizations include but are not limited to:

  • Predicate Pushdown: This optimization pushes filters and predicates closer to the data source, reducing the amount of unnecessary data that needs to be processed. When applied to window functions, predicate pushdown can optimize data filtering within the window, leading to more efficient processing.
  • Column Pruning: It eliminates unnecessary columns from being read or loaded during query execution, reducing I/O and memory usage. This optimization can be beneficial when working with window functions, as it minimizes the amount of data that needs to be processed within the window.
  • Constant Folding: This optimization identifies and evaluates constant expressions during query analysis, reducing unnecessary computations during query execution. While not directly related to window functions, constant folding contributes to overall query efficiency.
  • Cost-Based Optimization: It leverages statistics and cost models to estimate the cost of different query plans and selects the most efficient plan based on these estimates. This optimization can help select the most efficient execution plan for window function queries.

The Catalyst Optimizer also involves code generation, where it generates an efficient Java bytecode or optimizes Spark SQL code for executing the query. This code generation process further improves the performance by leveraging the optimizations provided by the underlying execution engine.

In the context of window functions, the Catalyst Optimizer aims to optimize the processing of window operations, including ranking, aggregation, and data access within the window. By applying these optimization techniques, the Catalyst Optimizer contributes to improved performance and efficient execution of Spark SQL queries involving window functions.

Tungsten Execution Engine:

The Tungsten Execution Engine is designed to optimize Spark jobs for CPU and memory efficiency, focusing on the hardware architecture of Spark’s platform. By leveraging off-heap memory management, cache-aware computation, and whole-stage code generation, Tungsten aims to substantially reduce the usage of JVM objects, improve cache locality, and generate efficient code for accessing memory structures.

Integrating the Tungsten Execution Engine with the Catalyst Optimizer allows Spark to handle complex calculations and analysis efficiently, including those involving window functions. This leads to improved performance and optimized data processing.

In the context of window functions, the Catalyst Optimizer generates an optimized physical query plan from the logical query plan by applying a series of transformations. This optimized query plan is then used by the Tungsten Execution Engine to generate optimized code, leveraging the Whole-Stage Codegen functionality introduced in Spark 2.0. The Tungsten Execution Engine focuses on optimizing Spark jobs for CPU and memory efficiency, and it leverages the optimized physical query plan generated by the Catalyst Optimizer to generate efficient code that resembles hand-written code.

Window functions within Spark’s architecture involve several stages:

Data Partitioning:

  • Data is divided into partitions for parallel processing across cluster nodes.
  • For window functions, partitioning is typically done based on specified columns.

Shuffle and Sort:

  • Spark may shuffle data to ensure all rows for a partition are on the same node.
  • Data is then sorted within each partition to prepare for rank calculations.

Rank Calculation and Ties Handling:

  • Ranks are computed within each partition, allowing nodes to process data independently.
  • Ties are managed during sorting, which is made efficient by Spark’s partitioning mechanism.

Lead and Lag Operations:

  • These functions work row-wise within partitions, processing rows in the context of their neighbours.
  • Data locality within partitions minimizes network data transfer, which is crucial for performance.

Execution Plan Optimization:

  • Spark employs lazy evaluation, triggering computations only upon an action request.
  • The Catalyst Optimizer refines the execution plan, aiming to minimize data shuffles.
  • The Tungsten Execution Engine optimizes memory and CPU resources, enhancing the performance of window function calculations.

Apache Spark window functions are a powerful tool for advanced data manipulation and analysis. They enable efficient ranking, aggregation, and access to adjacent rows without complex self-joins. By using these functions effectively, users can unlock the full potential of Apache Spark for analytical and processing needs and derive valuable insights from their data.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

AWS 101: Implementing IAM Roles for Enhanced Developer Access with Assume Role Policy

Setting up and using an IAM role in AWS involves three steps. Firstly, the user creates an IAM role and defines its trust relationships using an AssumeRole policy. Secondly, the user attaches an IAM-managed policy to the role, which specifies the permissions that the role has within AWS. Finally, the role is assumed through the AWS Security Token Service (STS), which grants temporary security credentials for accessing AWS services. This cycle of trust and permission granting, from user action to AWS STS and back, underpins secure AWS operations.

IAM roles are crucial for access management in AWS. This article provides a step-by-step walkthrough for creating a user-specific IAM role, attaching necessary policies, and validating for security and functionality.

Step 1: Compose a JSON file named assume-role-policy.json.

This policy explicitly defines the trusted entities that can assume the role, effectively safeguarding it against unauthorized access.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "PRINCIPAL_ARN"
},
"Action": "sts:AssumeRole"
}
]
}

This policy snippet should be modified by replacing PRINCIPAL_ARN it with the actual ARN of the user or service that needs to assume the role. The ARN can be obtained programmatically, as shown in the next step.

Step 2: Establishing the IAM Role via AWS CLI

The CLI is a direct and scriptable interface for AWS services, facilitating efficient role creation and management.

# Retrieve the ARN for the current user and store it in a variable
PRINCIPAL_ARN=$(aws sts get-caller-identity --query Arn --output text)

# Replace the placeholder in the policy template and create the actual policy
sed -i "s|PRINCIPAL_ARN|$PRINCIPAL_ARN|g" assume-role-policy.json

# Create the IAM role with the updated assume role policy
aws iam create-role --role-name DeveloperRole \
--assume-role-policy-document file://assume-role-policy.json \
--query 'Role.Arn' --output text

This command sequence fetches the user’s ARN, substitutes it into the policy document, and then creates the role DeveloperRole with the updated policy.

Step 3: Link the ‘PowerUserAccess’ managed policy to the newly created IAM role.

This policy confers essential permissions for a broad range of development tasks while adhering to the principle of least privilege by excluding full administrative privileges.

# Attach the 'PowerUserAccess' policy to the 'DeveloperRole'
aws iam attach-role-policy --role-name DeveloperRole \
--policy-arn arn:aws:iam::aws:policy/PowerUserAccess

The command attaches the necessary permissions to the DeveloperRole without conferring overly permissive access.

Assuming the IAM Role

Assume the IAM role to procure temporary security credentials. Assuming a role with temporary credentials minimizes security risks compared to using long-term access keys and confines access to a session’s duration.

# Assume the 'DeveloperRole' and specify the MFA device serial number and token code
aws sts assume-role --role-arn ROLE_ARN \
--role-session-name DeveloperSession \
--serial-number MFA_DEVICE_SERIAL_NUMBER \
--token-code MFA_TOKEN_CODE

The command now includes parameters for MFA, enhancing security. Replace ROLE_ARN the role’s ARN MFA_DEVICE_SERIAL_NUMBER with the serial number of the MFA device and MFA_TOKEN_CODE with the current MFA code.

Validation Checks

Execute commands to verify the permissions of the IAM role.

Validation is essential to confirm that the role possesses the correct permissions and is operative as anticipated.

List S3 Buckets:

# List S3 buckets using the assumed role's credentials
aws s3 ls --profile DeveloperSessionCredentials

This checks the ability to list S3 buckets, verifying that S3-related permissions are correctly granted to the role.

Describe EC2 Instances:

# Describe EC2 instances using the assumed role's credentials
aws ec2 describe-instances --profile DeveloperSessionCredentials

Validates the role’s permissions to view details about EC2 instances.

Attempt a Restricted Action:

# Try listing IAM users, which should be outside the 'PowerUserAccess' policy scope
aws iam list-users --profile DeveloperSessionCredentials

This command should fail, reaffirming that the role does not have administrative privileges.

Note: Replace --profile DeveloperSessionCredentials with the actual AWS CLI profile that has been configured with the assumed role’s credentials. To set up the profile with the new temporary credentials, you’ll need to update your AWS credentials file, typically located at ~/.aws/credentials.


Developers can securely manage AWS resources by creating an IAM role with scoped privileges. This involves meticulously validating the permissions of the role. Additionally, the role assumption process can be fortified with MFA to ensure an even higher level of security.

PlainEnglish.io 🚀

Thank you for being a part of the In Plain English community! Before you go:

Apache Spark 101: Understanding Spark Code Execution

Apache Spark is a powerful distributed data processing engine widely used in big data and machine learning applications. Thanks to its enriched API and robust data structures such as DataFrames and Datasets, it offers a higher level of abstraction than traditional map-reduce jobs.

Spark Code Execution Journey

Parsing

  • Spark SQL queries or DataFrame API methods are parsed into an unresolved logical plan.
  • The parsing step converts the code into a Spark-understandable format without checking table or column existence.

Analysis

  • The unresolved logical plan undergoes analysis by the Catalyst optimizer, Spark’s optimization framework.
  • This phase confirms the existence of tables, columns, and functions, resulting in a resolved logical plan where all references are validated against the catalogue schema.

Logical Plan Optimization

  • The Catalyst optimizer applies optimization rules to the resolved logical plan, potentially reordering joins, pushing down predicates, or combining filters, creating an optimized logical plan.

Physical Planning

  • The optimized logical plan is transformed into one or more physical plans, outlining the execution strategy and order of operations like map, filter, and join.

Cost Model

  • Spark evaluates these physical plans using a cost model, selecting the most efficient one based on data sizes and distribution heuristics.

Code Generation

  • Once the final physical plan is chosen, Spark employs WholeStage CodeGen to generate optimized Java bytecode that will run on the executors, minimizing JVM calls and optimizing execution.

Execution

  • The bytecode is distributed to executors across the cluster for execution, with tasks running in parallel, processing data in partitions, and producing the final output.

The Catalyst optimizer is integral throughout these steps, enhancing the performance of Spark SQL queries and DataFrame operations using rule-based and cost-based optimization.

Example Execution Plan

Consider a SQL query that joins two tables and filters and aggregates the data:

SELECT department, COUNT(*)
FROM employees
JOIN departments ON employees.dep_id = departments.id
WHERE employees.age > 30
GROUP BY department

The execution plan may follow these steps:

· Parsed Logical Plan: The initial SQL command is parsed into an unresolved logical plan.

· Analyzed Logical Plan: The plan is analyzed and resolved against the table schemas.

· Optimized Logical Plan: The Catalyst optimizer optimizes the plan.

· Physical Plan: A cost-effective physical plan is selected.

· Execution: The physical plan is executed across the Spark cluster.


The execution plan for the given SQL query in Apache Spark involves several stages, from logical planning to physical execution. Here’s a simplified breakdown:

Parsed Logical Plan: Spark parses the SQL query into an initial logical plan. This plan is unresolved as it only represents the structure of the query without checking the existence of the tables or columns.

'Project ['department]
+- 'Aggregate ['department], ['department, 'COUNT(1)]
+- 'Filter ('age > 30)
+- 'Join Inner, ('employees.dep_id = 'departments.id)
:- 'UnresolvedRelation `employees`
+- 'UnresolvedRelation `departments`

Analyzed Logical Plan: The parsed logical plan is analyzed against the database catalogue. This resolves table and column names and checks for invalid operations or data types.

Project [department#123]
+- Aggregate [department#123], [department#123, COUNT(1) AS count#124]
+- Filter (age#125 > 30)
+- Join Inner, (dep_id#126 = id#127)
:- SubqueryAlias employees
: +- Relation[age#125,dep_id#126] parquet
+- SubqueryAlias departments
+- Relation[department#123,id#127] parquet

Optimized Logical Plan: The Catalyst optimizer applies a series of rules to the logical plan to optimize it. It may reorder joins, push down filters, and perform other optimizations.

Aggregate [department#123], [department#123, COUNT(1) AS count#124]
+- Project [department#123, age#125]
+- Join Inner, (dep_id#126 = id#127)
:- Filter (age#125 > 30)
: +- Relation[age#125,dep_id#126] parquet
+- Relation[department#123,id#127] parquet

Physical Plan: Spark generates one or more physical plans from the logical plan. It then uses a cost model to choose the most efficient physical plan for execution.

*(3) HashAggregate(keys=[department#123], functions=[count(1)], output=[department#123, count#124])
+- Exchange hashpartitioning(department#123, 200)
+- *(2) HashAggregate(keys=[department#123], functions=[partial_count(1)], output=[department#123, count#125])
+- *(2) Project [department#123, age#125]
+- *(2) BroadcastHashJoin [dep_id#126], [id#127], Inner, BuildRight
:- *(2) Filter (age#125 > 30)
: +- *(2) ColumnarToRow
: +- FileScan parquet [age#125,dep_id#126]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, int, false]))
+- *(1) ColumnarToRow
+- FileScan parquet [department#123,id#127]

Code Generation: Spark generates Java bytecode for the chosen physical plan to run on each executor. This process is known as WholeStage CodeGen.

Execution: The bytecode is sent to Spark executors distributed across the cluster. Executors run the tasks in parallel, processing the data in partitions.

During execution, tasks are executed within stages, and stages may have shuffle boundaries where data is redistributed across the cluster. The Exchange hashpartitioning indicates a shuffle operation due to the GROUP BY clause.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

Want to Get Better at System Design Interviews? Here’s How to Prepare

System design interviews can be daunting due to their complexity and the vast knowledge required to excel. Whether you’re a recent graduate or a seasoned engineer, preparing for these interviews necessitates a well-thought-out strategy and access to the right resources. In this article, I’ll guide you to navigate the system design landscape and equip you to succeed in your upcoming interviews.

Start with the Basics

“Web Scalability for Startup Engineers” by Artur Ejsmont — This book is recommended as a starting point for beginners in system design.

“Designing Data-Intensive Applications” by Martin Kleppmann is described as a more in-depth resource for those with a basic understanding of system design.

It’s essential to establish a strong foundation before delving too deep into a subject. For beginners, “Web Scalability for Startup Engineers” is an excellent resource. It covers the basics and prepares you for more advanced concepts. After mastering the fundamentals, “Designing Data-Intensive Applications” by Martin Kleppmann will guide you further into data systems.

Microservices and Domain-Driven Design

“Building Microservices” by Sam Newman — Focuses on microservices architecture and its implications in system design.

Once you are familiar with the fundamentals, the next step is to explore the intricacies of the microservices architectural style through “Building Microservices.” To gain a deeper understanding of practical patterns and design principles, “Microservices Patterns and Best Practices” is an excellent resource. Lastly, for those who wish to understand the philosophy behind system architecture, “Domain-Driven Design” is a valuable read.

API Design and gRPC

“RESTful Web APIs” by Leonard Richardson, Mike Amundsen, and Sam Ruby provides a comprehensive guide to developing web-based APIs that adhere to the REST architectural style.

In the present world, APIs serve as the main connecting point of the internet. If you intend to design effective APIs, a good starting point would be to refer to “RESTful Web APIs” by Leonard Richardson and his colleagues. Moreover, if you are exploring the Remote Procedure Call (RPC) genre, particularly gRPC, then “gRPC: Up and Running” is a comprehensive guide.

Preparing for the Interview

“System Design Interview — An Insider’s Guide” by Alex Xu is an essential book for those preparing for challenging system design interviews.

It offers a comprehensive look at the strategies and thought processes required to navigate these complex discussions. Although it is one of many resources candidates will need, the book is tailored to equip them with the means to dissect and approach real interview questions. The book blends technical knowledge with the all-important communicative skills, preparing candidates to think on their feet and articulate clear and effective system design solutions. Xu’s guide demystifies the interview experience, providing a rich set of examples and insights to help candidates prepare for the interview process.

Domain-Specific Knowledge

Enhance your knowledge in your domain with books such as “Kafka: The Definitive Guide” for Distributed Messaging and “Cassandra: The Definitive Guide” for understanding wide-column stores. “Designing Event-Driven Systems” is crucial for grasping event sourcing and services using Kafka.

General Product Design

Pay attention to product design in system design. Books like “The Design of Everyday Things” and “Hooked: How to Build Habit-Forming Products” teach user-centric design principles, which are increasingly crucial in system design.

Online Resources

The internet is a goldmine of information. You can watch tech conference talks, follow YouTube channels such as Gaurav Sen’s System Design Interview and read engineering blogs from companies like Uber, Netflix, and LinkedIn.


System design is an iterative learning process that blends knowledge, curiosity, and experience. The resources provided here are a roadmap to guide you through this journey. With the help of these books and resources, along with practice and reflection, you will be well on your way to mastering system design interviews. Remember, it’s not just about understanding system design but also about thinking like a system designer.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

Optimizing Cloud Banking Service: Service Mesh for Secure Microservices Integration

As cloud computing continues to evolve, microservices architectures are becoming increasingly complex. To effectively manage this complexity, service meshes are being adopted. In this article, we will explain what a service mesh is, why it is necessary for modern cloud architectures, and how it addresses some of the most pressing challenges developers face today.

Understanding the Service Mesh

A service mesh is a configurable infrastructure layer built into an application that allows for the facilitation of flexible, reliable, and secure communications between individual service instances. Within a cloud-native environment, especially one that embraces containerization, a service mesh is critical in handling service-to-service communications, allowing for enhanced control, management, and security.

Why a Service Mesh?

As applications grow and evolve into distributed systems composed of many microservices, they often encounter challenges in service discovery, load balancing, failure recovery, security, and observability. A service mesh addresses these challenges by providing:

  • Dynamic Traffic Management: Adjusting the flow of requests and responses to accommodate changes in the infrastructure.
  • Improved Resiliency: Adding robustness to the system with patterns like retries, timeouts, and circuit breakers.
  • Enhanced Observability: Offering tools for monitoring, logging, and tracing to understand system performance and behaviour.
  • Security Enhancements: Ensuring secure communication through encryption and authentication protocols.

By implementing a service mesh, these distributed and loosely coupled applications can be managed more effectively, ensuring operational efficiency and security at scale.

Foundational Elements: Service Discovery and Proxies

The service mesh relies on two essential components — Consul and Envoy. The consul is responsible for service discovery, which means it keeps track of services, locations, and health status. It ensures that the system can adapt to changes in the environment. On the other hand, Envoy manages proxy services. It’s deployed alongside service instances and handles network communication. Envoy acts as an abstraction layer for traffic management and message routing.

Architectural Overview

The architecture consists of a Public and Private VPC setup, which encloses different clusters. The ‘LEFT_CLUSTER’ in the VPC is dedicated to critical services like logging and monitoring, which provide insights into the system’s operation and manage transactions. On the other hand, the ‘RIGHT_CLUSTER’ in the VPC contains services for Audit and compliance, Dashboards, and Archived Data, ensuring a robust approach to data management and regulatory compliance.

The diagram shows a service mesh architecture for sensitive banking operations in AWS. It comprises two clusters: the Left Cluster ( VPC) includes a Mesh Gateway, Bank Interface, Authentication and Authorization systems, and a Reconciliation Engine. Right Cluster (VPC) manages Audit, provides a Dashboard, stores Archived Data, and handles Notifications. Consul and Envoy Proxies efficiently manage communication. Monitored by dedicated tools, it ensures operational integrity and security in a complex banking ecosystem.

Mesh Gateways and Envoy Proxies

Mesh Gateways are crucial for inter-cluster communication, simplifying connectivity and network configurations. Envoy Proxies are strategically placed within the service mesh, managing the flow of traffic and enhancing the system’s ability to scale dynamically.

Security and User Interaction

The user’s journey begins with the authentication and authorization measures in place to verify and secure user access.

The Role of Consul

Consul’s service discovery capabilities are essential in allowing services like the Bank Interface and the Reconciliation Engine to discover each other and interact seamlessly, bypassing the limitations of static IP addresses.

Operational Efficiency

The service mesh’s contribution to operational efficiency is particularly evident in its integration with the Reconciliation Engine. This ensures that financial data requiring reconciliation is processed efficiently, securely, and directed towards the relevant services.

The Case for Service Mesh Integration

The shift to cloud-native architecture emphasizes the need for service meshes. This blueprint enhances agility, security, and technology, affirming the service mesh as pivotal for modern cloud networking.

In Plain English

Thank you for being a part of our community! Before you go: