Author : Shanoj
is a Data engineer and solutions architect passionate about delivering business value and actionable insights through well-architected data products. He holds several certifications on AWS, Oracle, Apache, Google Cloud, Docker, Linux and focuses on data engineering and analysis using SQL, Python, BigData, RDBMS, Apache Spark, among other technologies.
He has 17+ years of history working with various technologies in the Retail and BFS domains.
This hands-on article will help you create a simple deployment and a service, enabling you to manage and access applications efficiently within a Kubernetes cluster. You will create a deployment for the shanoj-testapp service with four replicas and a service that other pods in the cluster can access.
Creating the Deployment
A deployment ensures that a specified number of pod replicas are running at any given time. In this case, we will create a deployment for the shanoj-testapp service with four replicas.
Create the Deployment YAML File
Create a YAML file for the deployment using the cat command and the following content:
A service in Kubernetes defines a logical set of pods and a policy by which to access them. In this step, we will create a service to provide access to the shanoj-testapp pods.
Create the Service YAML File
Create a YAML file for the service using the cat command and the following content:
This YAML file defines a service named shanoj-svc that selects pods with the label app=shanoj-testapp and forwards traffic to port 80 on the pods.
Verifying the Service
After creating the service, verify that it is running and accessible within the cluster.
Check the Service Status
Use the following command to check the status of the shanoj-svc service:
kubectl get svc shanoj-svc
Access the Service from a shanojtesting-pod Pod
Creating the shanojtesting-pod Pod
You need to create a shanojtesting-pod pod to use it for testing. Create a YAML configuration file for the shanojtesting-pod pod. Create a file named shanojtesting-pod.yaml with the following content:
This article explains how a Bank Reconciliation System is structured on AWS, with the aim of processing and reconciling banking transactions. The system automates the matching of transactions from batch feeds and provides a user interface for manually reconciling any open items.
Architecture Overview
The BRS (Bank Reconciliation System) is engineered to support high-volume transaction processing with an emphasis on automation, accuracy, and user engagement for manual interventions. The system incorporates AWS cloud services to ensure scalability, availability, and security.
Technical Flow
Batch Feed Ingestion: Transaction files, referred to as “left” and “right” feeds, are exported from an on-premises data center into the AWS environment.
Storage and Processing: Files are stored in an S3 bucket, triggering AWS Lambda functions.
Automated Reconciliation: Lambda functions process the batch feeds to perform automated matching of transactions. Matched transactions are termed “auto-match.”
Database Storage: Both the auto-matched transactions and the unmatched transactions, known as “open items,” are stored in an Amazon Aurora database.
Application Layer: A backend application, developed with Spring Boot, interacts with the database to retrieve and manage transaction data.
User Interface: An Angular front-end application presents the open items to application users (bank employees) for manual reconciliation.
System Components
AWS S3: Initial repository for batch feeds. Its event-driven capabilities trigger processing via Lambda.
AWS Lambda: The serverless compute layer that processes batch feeds and performs auto-reconciliation.
Amazon Aurora: A MySQL and PostgreSQL compatible relational database used to store both auto-matched and open transactions.
Spring Boot: Provides the backend services that facilitate the retrieval and management of transaction data for the front-end application.
Angular: The front-end framework used to build the user interface for the manual reconciliation process.
System Interaction
Ingestion: Batch feeds from the on-premises data center are uploaded to AWS S3.
Triggering Lambda: S3 events upon file upload automatically invoke Lambda functions dedicated to processing these feeds.
Processing: Lambda functions parse the batch feeds, automatically reconcile transactions where possible, and identify open items for manual reconciliation.
Storing Results: Lambda functions store the outcomes in the Aurora database, segregating auto-matched and open items.
User Engagement: The Spring Boot application provides an API for the Angular front-end, through which bank employees access and work on open items.
Manual Reconciliation: Users perform manual reconciliations via the Angular application, which updates the status of transactions within the Aurora database accordingly.
Security and Compliance
Data Encryption: All data in transit and at rest are encrypted using AWS security services.
Identity Management: Amazon Cognito ensures secure user authentication for application access.
Web Application Firewall: AWS WAF protects against common web threats and vulnerabilities.
Monitoring and Reliability
CloudWatch: Monitors the system, logging all events, and setting up alerts for anomalies.
High Availability: The system spans multiple Availability Zones for resilience and employs Elastic Load Balancing for traffic distribution.
Scalability
Elastic Beanstalk & EKS: Both services can scale the compute resources automatically in response to the load, ensuring that the BRS can handle peak volumes efficiently.
Note: When you deploy an application using Elastic Beanstalk, it automatically sets up an Elastic Load Balancer in front of the EC2 instances that are running your application. This is to distribute incoming traffic across those instances to balance the load and provide fault tolerance.
Cost Optimization
S3 Intelligent-Tiering: Manages storage costs by automatically moving less frequently accessed data to lower-cost tiers.
DevOps Practices
CodeCommit & ECR: Source code management and container image repository are handled via AWS CodeCommit and ECR, respectively, streamlining the CI/CD pipeline.
The BRS leverages AWS services to create a seamless, automated reconciliation process, complemented by an intuitive user interface for manual intervention, ensuring a robust solution for the bank’s reconciliation needs.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
This article offers a detailed exploration of the design and implementation of bank reconciliation systems within an Online Transaction Processing (OLTP) environment and their integration with Online Analytical Processing (OLAP) systems for enhanced reporting. It navigates through the progression from transactional processing to analytical reporting, including schema designs and practical examples.
Dimensional Modeling is a data warehousing design approach that transforms complex databases into understandable schemas. It structures data using facts and dimensions, facilitating the creation of data cubes that enable sophisticated analytical queries for business intelligence and data analytics applications. This method ensures rapid data retrieval and aids in making informed decisions based on comprehensive data insights.
“Dimensional modeling is explicitly designed to address the need of users to understand the data easily and to analyze it rapidly.” — Ralph Kimball, “The Data Warehouse Toolkit”
OLTP System Schema
The Online Transaction Processing (OLTP) system is the backbone for capturing real-time banking transactions. It’s designed for high transactional volume, ensuring data integrity and quick response times.
Let’s use a fictitious use case to explain concepts and data modelling.
Core Tables Overview
FinancialInstitutions: Holds details about banks, like identifiers and addresses.
FinancialTransactions: Records each transaction needing reconciliation.
LifecycleAssociations: Tracks the transaction’s lifecycle stages.
AuditLogs: Logs all audit-related actions and changes.
Table Functions
FinancialInstitutions: Identifies banks involved in transactions.
FinancialTransactions: Central repository for transaction data.
LifecycleAssociations: Manages transaction progress through different stages.
AuditLogs: Ensures traceability and compliance through logging actions.
“In an OLTP system, the speed of transaction processing is often the key to business competitiveness.” — James Serra, in his blog on Data Warehousing
Transitioning to OLAP for Reporting
Transitioning to an Online Analytical Processing (OLAP) system involves denormalizing OLTP data to optimize for read-heavy analytical queries.
Star Schema Design for Enhanced Reporting
A star schema further refines the data structure for efficient querying, centring around FactTransactionAudit and connected to dimension tables:
DimBank
DimTransactionDetails
DimLifecycleStage
DimAuditTrail
Denormalized OLAP Schema Structure Explanation:
transaction_id: Serves as the primary key of the table, uniquely identifying each transaction in the dataset.
bank_id: Acts as a foreign key linking to the DimBank dimension table, which contains detailed information about each bank.
amount: Records the monetary value of the transaction.
transaction_date: Marks the date when the transaction occurred, useful for time-based analyses and reporting.
audit_id: A foreign key that references the DimAuditTrail dimension table, providing a link to audit information related to the transaction.
lifecycle_stage: Describes the current stage of the transaction within its lifecycle, such as “pending,” “processed,” or “reconciled,” which could be linked to the DimLifecycleStage dimension table for detailed descriptions of each stage.
is_auto_matched: A boolean or flag that indicates whether the transaction was matched automatically (AM), requiring no manual intervention. This is crucial for reporting and analyzing the efficiency of the reconciliation process.
“The objective of the data warehouse is to provide a coherent picture of the business at a point in time.” — Bill Inmon, “Building the Data Warehouse”
Reporting Use Cases: Auto Match Reporting
Identify transactions reconciled automatically. This requires joining the FactTransactionAudit table with dimension tables to filter auto-matched transactions.
Access Path for Auto Match Transactions
SELECT dt.transaction_id, db.name AS bank_name, -- Additional fields FROM FactTransactionAudit fta JOIN DimTransactionDetails dt ON fta.FK_transaction_id = dt.transaction_id -- Additional joins WHERE fta.is_auto_matched =TRUE;
Partitioning Strategy
Partitioning helps manage large datasets by dividing them into more manageable parts based on certain fields, often improving query performance by allowing systems to read only relevant partitions.
Suggested Partitioning Scheme:
For the ConsolidatedTransactions table in an OLAP setting like Hive:
By Date: Partitioning by transaction date (e.g., transaction_date) is a natural choice for financial data, allowing efficient queries over specific periods.
Structure: /year=YYYY/month=MM/day=DD/.
By Bank: If analyses often filter by specific banks, adding a secondary level of partitioning by bank_id can further optimize access patterns.
Processing billions of transactions every month can result in generating too many small files if daily partitioning is used. However, if monthly partitioning is applied, it may lead to the creation of very large files that are inefficient to process. To strike a balance between the two, a weekly or bi-weekly partitioning scheme can be adopted. This approach can reduce the overall number of files generated and keep the file sizes manageable.
# Assuming df is your DataFrame loaded with the transaction data df = df.withColumn("year_week", weekofyear(col("transaction_date")))
“In large databases, partitioning is critical for both performance and manageability.” — C.J. Date, “An Introduction to Database Systems”
Efficient Data Writes
Given the data volume, using repartition based on the partitioning scheme before writing can help distribute the data more evenly across the partitions, especially if the transactions are not uniformly distributed across the period.
Writing Data with Adaptive Repartitioning
Considering the volume, dynamic repartitioning based on the data characteristics of each write operation is necessary.
# Dynamically repartition based on the number of records # Aim for partitions with around 50 million to 100 million records each
num_partitions = df.count() // 50000000 if num_partitions < 1: num_partitions = 1 # Ensure at least one partition
For incremental loads, enabling dynamic partitioning in Spark is crucial to ensure that only the relevant partitions are updated without scanning or rewriting the entire dataset.
Small files can degrade performance in Hadoop ecosystems by overwhelming the NameNode with metadata operations and causing excessive overhead during processing.
Solutions:
Compaction Jobs: Regularly run compaction jobs to merge small files into larger ones within each partition. Spark can be used to read in small files and coalesce them into a smaller number of larger files.
Spark Repartition/Coalesce: When writing data out from Spark, use repartition to increase the number of partitions (and thus files) if needed, or coalesce to reduce them, depending on your use case.
Writing Data with Coalesce to Avoid Small Files:
# Assuming df is your DataFrame loaded with the data ready to be written
df.coalesce(10) \ # Adjust this number based on your specific needs .write \ .mode("overwrite") \ .partitionBy("year", "month", "day", "bank_id") \ .format("parquet") \ .saveAsTable("ConsolidatedTransactions")
Using Repartition for Better File Distribution
# Adjust the partition column names and number of partitions as needed
Efficiently handling this data volume requires monitoring and tuning Spark’s execution parameters and memory.
Cluster Capacity: Ensure the underlying hardware and cluster resources are scaled appropriately to handle the data volume and processing needs.
Archival Strategy: Implement a data archival or purging strategy for older data that is no longer actively queried to manage overall storage requirements.
To manage billions of transactions every month, it’s necessary to plan and optimize data storage and processing strategies. You can significantly improve the performance and scalability of your big data system by utilizing partitioning schemes that balance the file size with the number of files and by dynamically adjusting to the data volume during writes.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
Apache Spark provides two primary methods for performing aggregations: Sort-based aggregation and Hash-based aggregation. These methods are optimized for different scenarios and have distinct performance characteristics.
Hash-based Aggregation
Hash-based aggregation, as implemented by HashAggregateExec, is the preferred method for aggregation in Spark SQL when the conditions allow it. This method creates a hash table where each entry corresponds to a unique group key. As Spark processes rows, it quickly uses the group key to locate the corresponding entry in the hash table and updates the aggregate values accordingly. This method is generally faster because it avoids sorting the data before aggregation. However, it requires that all intermediate aggregate values fit into memory. If the dataset is too large or there are too many unique keys, Spark might be unable to use hash-based aggregation due to memory constraints. Key points about Hash-based aggregation include:
It is preferred when the aggregate functions and group by keys are supported by the hash aggregation strategy.
It can be significantly faster than sort-based aggregation because it avoids sorting data.
It uses off-heap memory for storing the aggregation map.
It may fall back to sort-based aggregation if the dataset is too large or has too many unique keys, leading to memory pressure.
Sort-based Aggregation
Sort-based aggregation, as implemented by SortAggregateExec, is used when hash-based aggregation is not feasible, either due to memory constraints or because the aggregation functions or group by keys are not supported by the hash aggregation strategy. This method involves sorting the data based on the group by keys and then processing the sorted data to compute aggregate values. While this method can handle larger datasets since it only requires some intermediate results to fit into memory, it is generally slower than hash-based aggregation due to the additional sorting step. Key points about Sort-based aggregation include:
It is used when hash-based aggregation is not feasible due to memory constraints or unsupported aggregation functions or group by keys.
It involves sorting the data based on the group by keys before performing the aggregation.
It can handle larger datasets since it streams data through disk and memory.
Detailed Explanation of Hash-based Aggregation
Hash-based Aggregation in Apache Spark operates through the HashAggregateExec physical operator. This process is optimized for aggregations where the dataset can fit into memory, and it leverages mutable types for efficient in-place updates of aggregation states.
Initialization: When a query that requires aggregation is executed, Spark determines whether it can use hash-based aggregation. This decision is based on factors such as the types of aggregation functions (e.g., sum, avg, min, max, count), the data types of the columns involved, and whether the dataset is expected to fit into memory.
Partial Aggregation (Map Side): The aggregation process begins with a “map-side” partial aggregation. For each partition of the input data, Spark creates an in-memory hash map where each entry corresponds to a unique group key. As rows are processed, Spark updates the aggregation buffer for each group key directly in the hash map. This step produces partial aggregate results for each partition.
Shuffling: After the partial aggregation, Spark shuffles the data by the grouping keys, so that all records belonging to the same group are moved to the same partition. This step is necessary to ensure that the final aggregation produces accurate results across the entire dataset.
Final Aggregation (Reduce Side): Once the shuffled data is partitioned, Spark performs the final aggregation. It again uses a hash map to aggregate the partially aggregated results. This step combines the partial results from different partitions to produce the final aggregate value for each group.
Spill to Disk: If the dataset is too large to fit into memory, Spark’s hash-based aggregation can spill data to disk. This mechanism ensures that Spark can handle datasets larger than the available memory by using external storage.
Fallback to Sort-based Aggregation: In cases where the hash map becomes too large or if there are memory issues, Spark can fall back to sort-based aggregation. This decision is made dynamically based on runtime conditions and memory availability.
Output: The final output of the HashAggregateExec operator is a new dataset where each row represents a group along with its aggregated value(s).
The efficiency of hash-based aggregation comes from its ability to perform in-place updates to the aggregation buffer and its avoidance of sorting the data. However, its effectiveness is limited by the available memory and the nature of the dataset. For datasets that do not fit well into memory or when dealing with complex aggregation functions that are not supported by hash-based aggregation, Spark might opt for sort-based aggregation instead.
Detailed Explanation of Sort-based Aggregation
Sort-based Aggregation in Apache Spark works through a series of steps that involve shuffling, sorting, and then aggregating the data.
Shuffling: The data is partitioned across the cluster based on the grouping keys. This step ensures that all records with the same key end up in the same partition.
Sorting: Within each partition, the data is sorted by the grouping keys. This is necessary because the aggregation will be performed on groups of data with the same key, and having the data sorted ensures that all records for a given key are contiguous.
Aggregation: Once the data is sorted, Spark can perform the aggregation. For each partition, Spark uses a SortBasedAggregationIterator to iterate over the sorted records. This iterator maintains a buffer row to cache the aggregated values for the current group.
Processing Rows: As the iterator goes through the rows, it processes them one by one, updating the buffer with the aggregate values. When the end of a group is reached (i.e., the next row has a different grouping key), the iterator outputs a row with the final aggregate value for that group and resets the buffer for the next group.
Memory Management: Unlike hash-based aggregation, which requires a hash map to hold all group keys and their corresponding aggregate values, sort-based aggregation only needs to maintain the aggregate buffer for the current group. This means that sort-based aggregation can handle larger datasets that might not fit entirely in memory.
Fallback Mechanism: Although not part of the normal operation, it’s worth noting that Spark’s HashAggregateExec can theoretically fall back to sort-based aggregation if it encounters memory issues during hash-based processing.
The sort-based aggregation process is less efficient than hash-based aggregation because it involves the extra step of sorting the data, which is computationally expensive. However, it is more scalable for large datasets or when dealing with immutable types in the aggregation columns that prevent the use of hash-based aggregation.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
Back in the 2020s, a project requirement came my way that was both challenging and interesting. My manager and I were very new to the company, so the success of this project was crucial for us. Although the technical stack of this project was relatively simple, it holds a special place in my heart for many reasons. In this article, I aim to share my experience and journey with it. This article is not about the how-to steps; rather, I want to share about mindset and culture on a personal level and as a team.
“The only way to do great work is to love what you do.” — Steve Jobs
The challenges were manifold: there was no budget or only a minimal budget available, a shortage of technical team members, and a need to deliver in a very short timeframe. Sounds interesting, right? I was thrilled by the requirement.
“Start where you are. Use what you have. Do what you can.” — Arthur Ashe
The high-level project details are as follows: From an upstream application to an Oracle-based application, we needed to transfer data downstream in JSON format. The backend-to-frontend relationship of the upstream application is one-to-many, meaning the backend Oracle data has multiple frontend views to represent the data based on business use or objects. Therefore, the data pipeline needed the capacity to transform Oracle-based data into many front-end views, necessitating mapping the data with XML configuration for the front end. The daily transaction volume was not too large, but a decent amount of 3 million records.
Oversimplified/high-level
“It always seems impossible until it’s done.” — Nelson Mandela
Since we had no budget or resource strength, I decided to explore open-source and no-code or low-code solutions. After a day or two of research, I decided to use Apache NiFi and MongoDB. The idea was to reverse-engineer the current upstream application using NiFi, which means NiFi would read data from Oracle and simultaneously map those data according to the frontend configuration stored in XML for the frontend application.
“Teamwork is the ability to work together toward a common vision. The ability to direct individual accomplishments toward organizational objectives. It is the fuel that allows common people to attain uncommon results.” — Andrew Carnegie
When I first presented this solution to the team and management, the initial response was not very positive. Except for my manager, most stakeholders expressed doubts, which is understandable since most of them were hearing about Apache NiFi for the first time, and their experience lay primarily in Oracle and Java. MongoDB was also new to them. Normally, people think about “how” first, but I was focused on “why” and “what” before “how.”
After several rounds of discussion, including visual flowcharts and small prototypes, everyone agreed on the tech stack and the solution.
This led to the next challenge: the team we assembled had never worked with these tech stacks. To move forward, we needed to build a production-grade platform. I started by reading books and devised infrastructure build steps for both NiFi and MongoDB clusters. It was challenging, and there were extensive troubleshooting sessions, but we finally built our platform.
“Data Engineering with Python: Work with massive datasets to design data models and automate data pipelines using Python” by Paul Crickard. Even though we use Groovy for scripting and not Python, this book helped us understand Apache NiFi in greater detail. I remember those days; I used to carry this book with me everywhere I went, even to the park and while waiting for signals on the road. Thanks to the author, Paul Crickard.
“Don’t watch the clock; do what it does. Keep going.” — Sam Levenson
The team was very small, comprising two Java developers and two Linux developers, all new to these tech stacks. Once the platform was ready, my next task was to train our developers to bring them up to speed. Fortunately, I found a book that helped me with Apache NiFi, and for MongoDB, I took a MongoDB University course. I filtered the required information for my team to learn. There were also challenges, as my team was in China and I was in the North American timezone, requiring me to work mostly at night to help my team.
Another challenge that later became an advantage was communication. My Chinese team was not very comfortable with verbal explanations, so I decided to share knowledge through visual aids like flowcharts and diagrams, along with developing small prototypes for them to try and walk through. This approach helped a lot later on, as I learned how to represent complex topics in simple diagrams, a skill I use in my articles today.
“What you get by achieving your goals is not as important as what you become by achieving your goals.” — Zig Ziglar
The lucky factor was that my team was extremely hardworking and quick learners. They adapted to the new tech stack in minimal time. In three months, we built the first end-to-end MVP and presented a full demo to the business, which was a huge success. Three months later, we scaled the application to production-grade and delivered our first release. Today, this team is the fastest in the release cycle and operates independently.
“It is not mandatory for a Solution Architect to be an SME in any specific tool or technology. Instead, he or she should be a strategic thinker, putting “why” and “what” first and adapting quickly to take advantage of opportunities where we can add value to the business or end user despite all limitations and trade-offs.” — Shanoj Kumar V
“Success is not the key to happiness. Happiness is the key to success. If you love what you are doing, you will be successful.” — Albert Schweitzer
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
Memory spill in Apache Spark is the process of transferring data from RAM to disk, and potentially back again. This happens when the dataset exceeds the available memory capacity of an executor during tasks that require more memory than is available. In such cases, data is spilled to disk to free up RAM and prevent out-of-memory errors. However, this process can slow down processing due to the slower speed of disk I/O compared to memory access.
Dynamic Occupancy Mechanism
Apache Spark employs a dynamic occupancy mechanism for managing Execution and Storage memory pools. This mechanism enhances the flexibility of memory usage by allowing Execution Memory and Storage Memory to borrow from each other, depending on workload demands:
Execution Memory: Primarily used for computation tasks such as shuffles, joins, and sorts. When execution tasks demand more memory, they can borrow from the Storage Memory if it is underutilized.
Storage Memory: Used for caching and persisting RDDs, DataFrames, and Datasets. If the demand for storage memory exceeds its allocation, and Execution Memory is not fully utilized, Storage Memory can expand into the space allocated for Execution Memory.
Spark’s internal memory manager controls this dynamic sharing and is crucial for optimizing the utilization of available memory resources, significantly reducing the likelihood of memory spills.
Common Performance Issues Related to Spills
Spill(disk) and Spill(memory): When data doesn’t fit in RAM, it is temporarily written to disk. This operation, while enabling Spark to handle larger datasets, impacts computation time and efficiency because disk access is slower than memory access.
Impact on Performance: Spills to disk can negatively affect performance, increasing both the cost and operational complexity of Spark applications. The strength of Spark lies in its in-memory computing capabilities; thus, disk spills are counterproductive to its design philosophy.
Solutions for Memory Spill in Apache Spark
Mitigating memory spill issues involve several strategies aimed at optimizing memory use, partitioning data more effectively, and improving overall application performance.
Optimizing Memory Configuration
Adjust memory allocation settings to provide sufficient memory for both execution and storage, potentially increasing the memory per executor.
Tune the ratio between execution and storage memory based on the specific requirements of your workload.
Partitioning Data
Optimize data partitioning to ensure even data distribution across partitions, which helps in avoiding memory overloads in individual partitions.
Consider different partitioning strategies such as range, hash, or custom partitioning based on the nature of your data.
Caching and Persistence
Use caching and persistence methods (e.g., cache() or persist()) to store intermediate results or frequently accessed data in memory, reducing the need for recomputation.
Select the appropriate storage level for caching to balance between memory usage and CPU efficiency.
Monitoring and Tuning
Monitor memory usage and spills using Spark UI or other monitoring tools to identify and address bottlenecks.
Adjust configurations dynamically based on performance metrics and workload patterns.
Data Compression
Employ data compression techniques and columnar storage formats (e.g., Parquet, ORC) to reduce the memory footprint.
Compress RDDs using serialization mechanisms like MEMORY_ONLY_SER to minimize memory usage.
Avoiding Heavy Shuffles
Optimize join operations and minimize unnecessary data movement by using strategies such as broadcasting smaller tables or implementing partition pruning.
Reduce shuffle operations which can lead to spills by avoiding wide dependencies and optimizing shuffle operations.
Formulaic Approach to Avoid Memory Spills
Apache Spark’s memory management model is designed to balance between execution memory (used for computation like shuffles, joins, sorts) and storage memory (used for caching and persisting data). Understanding and optimizing the use of these memory segments can significantly reduce the likelihood of memory spills.
Memory Configuration Parameters:
Total Executor Memory (spark.executor.memory): The total memory allocated per executor.
Memory Overhead (spark.executor.memoryOverhead): Additional memory allocated to each executor, beyond spark.executor.memory, for Spark to execute smoothly.
Spark Memory Fraction (spark.memory.fraction): Specifies the proportion of the executor memory dedicated to Spark’s memory management system (default is 0.6 or 60%).
Determine Execution and Storage Memory: Spark splits the available memory between execution and storage. The division is dynamic, but under memory pressure, storage can shrink to as low as the value defined by spark.memory.storageFraction (default is 0.5 or 50% of Spark memory).
Example Calculation:
Suppose an executor is configured with 10GB (spark.executor.memory = 10GB) and the default overhead (10% of executor memory or at least 384MB). Let’s assume an overhead of 1GB for simplicity and the default memory fractions.
Total Executor Memory: 10GB
Memory Overhead: 1GB
Spark Memory Fraction: 0.6 (60%)
Available Memory for Spark=(10GB−1GB)×0.6=5.4GBAvailable Memory for Spark=(10GB−1GB)×0.6=5.4GB
Assuming spark.memory.storageFraction is set to 0.5, both execution and storage memory pools could use up to 2.7GB each under balanced conditions.
Strategies to Avoid Memory Spills:
Increase Memory Allocation: If possible, increasing spark.executor.memory ensures more memory is available for Spark processes.
Adjust Memory Fractions: Tweaking spark.memory.fraction and spark.memory.storageFraction can help allocate memory more efficiently based on the workload. For compute-intensive operations, you might allocate more memory for execution.
Real-life Use Case: E-commerce Sales Analysis
An e-commerce platform experienced frequent memory spills while processing extensive sales data during holiday seasons, leading to performance bottlenecks.
Problem:
Large-scale aggregations and joins were causing spills to disk, slowing down the analysis of sales data, impacting the ability to generate real-time insights for inventory and pricing adjustments.
Solution:
Memory Optimization: The data team increased the executor memory from 8GB to 16GB per executor and adjusted the spark.memory.fraction to 0.8 to dedicate more memory to Spark’s managed memory system.
Partitioning and Data Skew Management: They implemented custom partitioning strategies to distribute the data more evenly across nodes, reducing the likelihood of individual tasks running out of memory.
Caching Strategy: Important datasets used repeatedly across different stages of the analysis were persisted in memory, and the team carefully chose the storage levels to balance between memory usage and CPU efficiency.
Monitoring and Tuning: Continuous monitoring of the Spark UI and logs allowed the team to identify memory-intensive operations and adjust configurations dynamically. They also fine-tuned spark.memory.storageFraction to better balance between execution and storage memory, based on the nature of their tasks.
These strategies significantly reduced the occurrence of memory spills, improved the processing speed of sales data analysis, and enabled the e-commerce platform to adjust inventory and pricing strategies in near real-time during peak sales periods.
This example demonstrates the importance of a holistic approach to Spark memory management, including proper configuration, efficient data partitioning, and strategic use of caching, to mitigate memory spill issues and enhance application performance.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
In exploring the integration of Amazon Redshift and Redshift Spectrum for data warehousing and data lake architectures, it’s essential to consider a scenario where a data engineer sets up a daily data loading pipeline into a data warehouse.
This setup is geared towards optimizing the warehouse for the majority of reporting queries, which typically focus on the latest 12 months of data. To maintain efficiency and manage storage, the engineer might also implement a process to remove data older than 12 months. However, this strategy raises a question: how to handle the 20% of queries that require historical data beyond this period?
Amazon Redshift is a powerful, scalable data warehouse service that simplifies the process of analyzing large volumes of data with high speed and efficiency. It allows for complex queries over vast datasets, providing the backbone for modern data analytics. Redshift’s architecture is designed to handle high query loads and vast amounts of data, making it an ideal solution for businesses seeking to leverage their data for insights and decision-making. Its columnar storage and data compression capabilities ensure that data is stored efficiently, reducing the cost and increasing the performance of data operations.
Redshift Spectrum extends the capabilities of Amazon Redshift by allowing users to query and analyze data stored in Amazon S3 directly from within Redshift, without the need for loading or transferring the data into the data warehouse. This feature is significant because it enables users to access both recent and historical data seamlessly, bridging the gap between the data stored in Redshift and the extensive, unstructured data residing in a data lake. Spectrum offers the flexibility to query vast amounts of data across a data lake, providing the ability to run complex analyses on data that is not stored within the Redshift cluster itself.
Here, Redshift Spectrum plays a crucial role. It’s a feature that extends the capabilities of the Amazon Redshift data warehouse, allowing it to query data stored externally in a data lake. This functionality is significant because it enables users to access both recent and historical data without the need to store all of it directly within the data warehouse.
The process starts with the AWS Glue Data Catalog, which acts as a central repository for all the databases and tables in the data lake. By setting up Amazon Redshift to work with the AWS Glue Data Catalog, users can seamlessly query tables both inside Redshift and those cataloged in the AWS Glue. This setup is particularly advantageous for comprehensive data analysis, bridging the gap between the structured environment of the data warehouse and the more extensive, unstructured realm of the data lake.
AWS Glue Data Catalog and Apache Hive Metastore are both metadata repositories for managing data structures in data lakes and warehouses. AWS Glue Data Catalog, a cloud-native service, integrates seamlessly with AWS analytics services, offering automatic schema discovery and a fully managed experience. In contrast, Hive Metastore requires more manual setup and maintenance and is primarily used in on-premises or hybrid cloud environments. AWS Glue Data Catalog is easier to use, automated, and tightly integrated within the AWS ecosystem, making it the preferred choice for users invested in AWS services.
In Plain English 🚀
Thank you for being a part of the In Plain English community! Before you go:
AWS Glue is a managed and serverless service that assists in data preparation for analytics. It automates the ETL (Extract, Transform, Load) process and provides two primary components for data transformation: the Glue Python Shell for smaller datasets and Apache Spark for larger datasets. Both of these components can interact with data in Amazon S3, the AWS Glue Data Catalog, and various databases or data integration services. AWS Glue simplifies ETL tasks by managing the computing resources required, which are measured in data processing units (DPUs).
Key Takeaway: AWS Glue eliminates the need for server management and is highly scalable, making it an ideal choice for businesses looking to streamline their data transformation and loading processes without deep infrastructure knowledge.
AWS Glue Data Catalog
The AWS Glue Data Catalog acts as a central repository for metadata storage, akin to a Hive metastore, facilitating the management of ETL jobs. It integrates seamlessly with other AWS services like Athena and Amazon EMR, allowing for efficient data queries and analytics. Glue Crawlers automatically discover and catalog data across services, simplifying the process of ETL job design and execution.
Key Takeaway: Utilizing the AWS Glue Data Catalog can significantly reduce the time and effort required to prepare data for analytics, providing an automated, organized approach to data management and integration.
Amazon EMR Overview
Amazon EMR is a cloud big data platform for processing massive amounts of data using open-source tools such as Apache Spark, HBase, Presto, and Hadoop. Unlike AWS Glue’s serverless approach, EMR requires the manual setup of clusters, offering a more customizable environment. EMR supports a broader range of big data tools and frameworks, making it suitable for complex analytical workloads that benefit from specific configurations and optimizations.
Key Takeaway: Amazon EMR is best suited for users with specific requirements for their big data processing tasks that necessitate fine-tuned control over their computing environments, as well as those looking to leverage a broader ecosystem of big data tools.
Glue Workflows for Orchestrating Components
AWS Glue Workflows provides a managed orchestration service for automating the sequencing of ETL jobs. This feature allows users to design complex data processing pipelines triggered by schedule, event, or job completion, ensuring a seamless flow of data transformation and loading tasks.
Key Takeaway: By leveraging AWS Glue Workflows, businesses can efficiently automate their data processing tasks, reducing manual oversight and speeding up the delivery of analytics-ready data.
In Plain English 🚀
Thank you for being a part of the In Plain English community! Before you go:
The critical competencies of an architect are the foundation of their profession. They include a Strategic Mindset, Technical Acumen, Domain Knowledge, and Leadership capabilities. These competencies are not just buzzwords; they are essential attributes that define an architect’s ability to navigate and shape the built environment effectively.
Growth Path
The growth journey of an architect involves evolving expertise, which begins with a technical foundation and gradually expands into domain-specific knowledge before culminating in strategic leadership. This journey progresses through various stages, starting from the role of a Technical Architect, advancing through Solution and Domain Architect, and evolving into a Business Architect. The journey then peaks with the positions of Enterprise Architect and Chief Enterprise Architect. Each stage in this progression requires a deeper understanding and broader vision, reflecting the multifaceted nature of architectural practice.
Qualities of a Software Architect
Visual Thinking: Crucial for software architects, this involves the ability to conceptualize and visualize complex software systems and frameworks. It’s essential for effective communication and the realization of software architectural visions. By considering factors like system scalability, interoperability, and user experience, software architects craft visions that guide development teams and stakeholders, ensuring successful project outcomes.
Foundation in Software Engineering: A robust foundation in software engineering principles is vital for designing and implementing effective software solutions. This includes understanding software development life cycles, agile methodologies, and continuous integration/continuous deployment (CI/CD) practices, enabling software architects to build efficient, scalable, and maintainable systems.
Modelling Techniques: Mastery of software modelling techniques, such as Unified Modeling Language (UML) diagrams, entity-relationship diagrams (ERD), and domain-driven design (DDD), allows software architects to efficiently structure and communicate complexsystems. These techniques facilitate the clear documentation and understanding of software architecture, promoting better team alignment and project execution.
Infrastructure and Cloud Proficiency: Modern infrastructure, including cloud services (AWS, Azure, Google Cloud), containerization technologies (Docker, Kubernetes), and serverless architectures, is essential. This knowledge enables software architects to design systems that are scalable, resilient, and cost-effective, leveraging the latest in cloud computing and DevOps practices.
Security Domain Expertise: A deep understanding of cybersecurity principles, including secure coding practices, encryption, authentication protocols, and compliance standards (e.g., GDPR, HIPAA), is critical. Software architects must ensure the security and privacy of the applications they design, protecting them from vulnerabilities and threats.
Data Management and Analytics: Expertise in data architecture, including relational databases (RDBMS), NoSQL databases, data warehousing, big data technologies, and data streaming platforms, is crucial. Software architects need to design data strategies that support scalability, performance, and real-time analytics, ensuring that data is accessible, secure, and leveraged effectively for decision-making.
Leadership and Vision: Beyond technical expertise, the ability to lead and inspire development teams is paramount. Software architects must possess strong leadership qualities, fostering a culture of innovation, collaboration, and continuous improvement. They play a key role in mentoring developers, guiding architectural decisions, and aligning technology strategies with business objectives.
Critical and Strategic Thinking: Indispensable for navigating the complexities of software development, these skills enable software architects to address technical challenges, evaluate trade-offs, and make informed decisions that balance immediate needs with long-term goals.
Adaptive and Big Thinking: The ability to adapt to rapidly changing technology landscapes and think broadly about solutions is essential. Software architects must maintain a holistic view of their projects, considering not only the technical aspects but also market trends, customer needs, and business strategy. This broad perspective allows them to identify innovative opportunities and drive technological advancement within their organizations.
As software architects advance through their careers, from Technical Architect to Chief Enterprise Architect, they cultivate these essential qualities and competencies. This professional growth enhances their ability to impact projects and organizations significantly, leading teams to deliver innovative, robust, and scalable software solutions.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
A Data Lake is a centralized location designed to store, process, and protect large amounts of data from various sources in its original format. It is built to manage the scale, versatility, and complexity of big data, which includes structured, semi-structured, and unstructured data. It provides extensive data storage, efficient data management, and advanced analytical processing across different data types. The logical architecture of a Data Lake typically consists of several layers, each with a distinct purpose in the data lifecycle, from data intake to utilization.
Data Delivery Type and Production Cadence
Data within the Data Lake can be delivered in multiple forms, including table rows, data streams, and discrete data files. It supports various production cadences, catering to batch processing and real-time streaming, to meet different operational and analytical needs.
Landing / Raw Zone The Landing or Raw Zone
Is the initial repository for all incoming data, where it is stored in its original, unprocessed form. This area serves as the data’s entry point, maintaining its integrity and ensuring traceability by preserving it immutable.
Clean/Transform Zone
Following the landing zone,data is moved to the Clean/Transform Zone, where it undergoes cleaning, normalization, and transformation. This step prepares the data for analysis by standardizing its format and structure, enhancing data quality and usability.
Cataloguing & Search Layer
The Ingestion Layer manages data entry into the Data Lake, capturing essential metadata and categorizing data appropriately. It supports various data ingestion methods, including batch and real-time streams, facilitating efficient data discovery and management.
Data Structure
The Data Lake accommodates a wide range of data structures, from structured, such as databases and CSV files, to semi-structured, like JSON and XML, and unstructured data, including text documents and multimedia files.
Processing Layer
The Processing Layer is at the heart of the Data Lake, equipped with powerful tools and engines for data manipulation, transformation, and analysis. It facilitates complex data processing tasks, enabling advanced analytics and data science projects.
Curated/Enriched Zone
Data that has been cleaned and transformed is further refined in the Curated/Enriched Zone. It is enriched with additional context or combined with other data sources, making it highly valuable for analytical and business intelligence purposes. This zone hosts data ready for consumption by end-users and applications.
Consumption Layer
Finally, the Consumption Layer provides mechanisms for end-users to access and utilize the data. Through various tools and applications, including business intelligence platforms, data visualization tools, and APIs, users can extract insights and drive decision-making processes based on the data stored in the Data Lake.
AWS Data Lakehouse Architecture
Oversimplified/high-level
An AWS Data Lakehouse is a powerful combination of data lakes and data warehouses, which utilizes Amazon Web Services to establish a centralized data storage solution. This solution caters to both raw data in its primitive form and the precision required for intricate analysis. By breaking down data silos, a Data Lakehouse strengthens data governance and security while simplifying advanced analytics. It offers businesses an opportunity to uncover new insights while preserving the flexibility of data management and analytical capabilities.
Kinesis Firehose
Amazon Kinesis Firehose is a fully managed service provided by Amazon Web Services (AWS) that enables you to easily capture and load streaming data into data stores and analytics tools. With Kinesis Firehose, you can ingest, transform, and deliver data in real time to various destinations such as Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service. The service is designed to scale automatically to handle any amount of streaming data and requires no administration. Kinesis Firehose supports data formats such as JSON, CSV, and Apache Parquet, among others, and provides built-in data transformation capabilities to prepare data for analysis. With Kinesis Firehose, you can focus on your data processing logic and leave the data delivery infrastructure to AWS.
Amazon CloudWatch
Amazon CloudWatch is a monitoring service that helps you keep track of your operational metrics and logs and sends alerts to optimize performance. It enables you to monitor and collect data on various resources like EC2 instances, RDS databases, and Lambda functions, in real-time. With CloudWatch, you can gain insights into your application’s performance and troubleshoot issues quickly.
Amazon S3 for State Backend
The Amazon S3 state backend serves as the backbone of the Data Lakehouse. It acts as a repository for the state of streaming data, eternally preserving it.
Amazon Kinesis Data Analytics
Amazon Kinesis Data Analytics uses SQL and Apache Flink to provide real-time analytics on streaming data with precision.
Amazon S3
Amazon S3 is a secure, scalable, and resilient storage for the Data Lakehouse’s data.
AWS Glue Data Catalog
The AWS Glue Data Catalog is a fully managed metadata repository that enables easy data discovery, organization, and management for streamlined analytics and processing in the Data Lakehouse. It provides a unified view of all data assets, including databases, tables, and partitions, making it easier for data engineers, analysts, and scientists to find and use the data they need. The AWS Glue Data Catalog also supports automatic schema discovery and inference, making it easier to maintain accurate and up-to-date metadata for all data assets. With the AWS Glue Data Catalog, organizations can improve data governance and compliance, reduce data silos, and accelerate time-to-insight.
Amazon Athena
Amazon Athena enables users to query data in Amazon S3 using standard SQL without ETL complexities, thanks to its serverless and interactive architecture.
Amazon Redshift
Amazon Redshift is a highly efficient and scalable data warehouse service that streamlines the process of data analysis. It is designed to enable users to query vast amounts of structured and semi-structured data stored across their data warehouse, operational database, and data lake using standard SQL. With Amazon Redshift, users can gain valuable insights and make data-driven decisions quickly and easily. Additionally, Amazon Redshift is fully managed, allowing users to focus on their data analysis efforts rather than worrying about infrastructure management. Its flexible pricing model, based on usage, makes it a cost-effective solution for businesses of all sizes.
Consumption Layer
The Consumption Layer includes business intelligence tools and applications like Amazon QuickSight. This layer allows end-users to visualize, analyze, and interpret the processed data to derive actionable business insights.
In Plain English 🚀
Thank you for being a part of the In Plain English community! Before you go: