In a distributed capital markets system, client requests like trade executions, settlement confirmations, and clearing operations must be replicated across multiple nodes for consistency and fault tolerance. Finalizing any operation requires confirmation from all nodes or a Majority Quorum.
For instance, when a trade is executed, multiple nodes (such as trading engines, clearing systems, and settlement systems) must confirm the transaction’s key details — such as the trade price, volume, and counterparty information. This multi-node confirmation ensures that the trade is valid and can proceed. These confirmations are critical to maintaining consistency across the distributed system, helping to prevent discrepancies that could result in financial risks, errors, or regulatory non-compliance.
However, asynchronous communication between nodes means responses arrive at different times, complicating the process of tracking requests and waiting for a quorum before proceeding with the operation.
“The following diagram illustrates the problem scenario, where client trade requests are propagated asynchronously across multiple nodes. Each node processes the request at different times, and the system must wait for a majority quorum of responses to proceed with the trade confirmation.”
Solution: Request Waiting List Distributed Design Pattern
The Request Waiting Listpattern addresses the challenge of tracking client requests that require responses from multiple nodes asynchronously. It maintains a list of pending requests, with each request linked to a specific condition that must be met before the system responds to the client.
Definition:
The Request Waiting List pattern is a distributed system design pattern used to track client requests that require responses from multiple nodes asynchronously. It maintains a queue of pending requests, each associated with a condition or callback that triggers when the required criteria (such as a majority quorum of responses or a specific confirmation) are met. This ensures that operations are finalized consistently and efficiently, even in systems with asynchronous communication between nodes.
Application in Capital Markets Use Case:
“The flowchart illustrates the process of handling a trade request in a distributed capital markets system using the Request Waiting List pattern. The system tracks each request and waits for a majority quorum of responses before fulfilling the client’s trade request, ensuring consistency across nodes.”
1. Asynchronous Communication in Trade Execution:
When a client initiates a trade, the system replicates the trade details across multiple nodes (e.g., order matching systems, clearing systems, settlement systems).
These nodes communicate asynchronously, meaning that responses confirming trade details or executing the order may arrive at different times.
2. Tracking with a Request Waiting List:
The system keeps a waiting list that maps each trade request to a unique identifier, such as a trade ID or correlation ID.
Each request has an associated callback function that checks if the necessary conditions are fulfilled to proceed. For example, the callback may be triggered when confirmation is received from a specific node (such as the clearing system) or when a majority of nodes have confirmed the trade.
3. Majority Quorum for Trade Confirmation:
The system waits until a majority quorum of confirmations is received (e.g., more than half of the relevant nodes) before proceeding with the next step, such as executing the trade or notifying the client of the outcome.
4. Callback and High-Water Mark in Settlement:
In the case of settlement, the High-Water Mark represents the total number of confirmations required from settlement nodes before marking the transaction as complete.
Once the necessary conditions — such as price match and volume confirmation — are satisfied, the callback is invoked. The system then informs the client that the trade has been successfully executed or settled.
This approach ensures that client requests in capital markets systems are efficiently tracked and processed, even in the face of asynchronous communication. By leveraging a waiting list and majority quorum, the system ensures that critical operations like trade execution and settlement are handled accurately and consistently.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
The following diagram provides a complete overview of how the Clock-Bound Wait pattern ensures consistent transaction processing across nodes. Node A processes a transaction and waits for 20 milliseconds to account for clock skew before committing the transaction. Node B, which receives a read request, waits for its clock to catch up before reading and returning the updated value.
In distributed banking systems, ensuring data consistency across multiple nodes is critical, especially when transactions are processed across geographically dispersed regions. One major challenge is that system clocks on different nodes may not always be synchronized, leading to inconsistent data when updates are propagated at different times. The Clock-Bound Wait pattern addresses these clock discrepancies and ensures that data is consistently ordered across all nodes.
The Problem: Time Discrepancies and Data Inconsistency
In a distributed banking system, when customer transactions such as deposits and withdrawals are processed, the local node handling the transaction uses its system clock to timestamp the operation. If the system clocks of different nodes are not perfectly aligned, it may result in inconsistencies when reading or writing data. For instance, Node A may process a transaction at 10:00 AM, but Node B, whose clock is lagging, could still show the old account balance because it hasn’t yet caught up to Node A’s time. This can lead to confusion and inaccuracies in customer-facing data.
As seen in the diagram below, the clocks of various nodes in a distributed system may not be perfectly synchronized. Even a small time difference, known as clock skew, can cause nodes to process transactions at different times, resulting in data inconsistency.
Clock-Bound Wait: Ensuring Correct Ordering of Transactions
To solve this problem, the Clock-Bound Wait pattern introduces a brief waiting period when processing transactions to ensure that all nodes have advanced past the timestamp of the transaction being written or read. Here’s how it works:
Maximum Clock Offset: The system first calculates the maximum time difference, or offset, between the fastest and slowest clocks across all nodes. For example, if the maximum offset is 20 milliseconds, this value is used as the buffer for synchronizing data.
Waiting to Guarantee Synchronization: When Node A processes a transaction, it waits for a period (based on the maximum clock offset) to ensure that all other nodes have moved beyond the transaction’s timestamp before committing the change. For example, if Node A processes a transaction at 10:00 AM, it will wait for 20 milliseconds to ensure that all nodes’ clocks are past 10:00 AM before confirming the transaction.
The diagram illustrates how a transaction is processed at Node A, with a controlled wait for 20 milliseconds to allow other nodes (Node B and Node C) to synchronize their clocks before the transaction is committed across all nodes. This ensures that no node processes outdated or incorrectly ordered transactions.
Consistent Reads and Writes: The same waiting mechanism is applied when reading data. If a node receives a read request but its clock is behind the latest transaction timestamp, it waits for its clock to synchronize before returning the correct, updated data.
The diagram illustrates how a customer request for an account balance is handled. Node B, with a clock lagging behind Node A, must wait for its clock to synchronize before returning the updated balance, ensuring that the customer sees the most accurate data.
Eventual Consistency Without Significant Delays: Although the system introduces a brief wait period to account for clock discrepancies, the Clock-Bound Wait pattern allows the system to remain eventually consistent without significant delays in transaction processing. This ensures that customers experience up-to-date information without noticeable latency.
The diagram below demonstrates how regional nodes in different locations (North America, Europe, and Asia) wait for clock synchronization to ensure that transaction updates are consistent across the entire system. Once the clocks are in sync, final consistency is achieved across all regions.
Application in Banking Systems
In a distributed banking system, the Clock-Bound Wait pattern ensures that account balances and transaction histories remain consistent across all nodes. When a customer performs a transaction, the system guarantees that the updated balance is visible across all nodes after a brief wait period, regardless of clock discrepancies. This prevents situations where one node shows an outdated balance while another node shows the updated balance.
The Clock-Bound Wait pattern is a practical solution for managing clock discrepancies in distributed banking systems. By introducing a brief wait to synchronize clocks across nodes, the pattern ensures that transactions are consistently ordered and visible, maintaining data accuracy without significant performance overhead. This approach is particularly valuable in high-stakes industries like banking, where consistency and reliability are paramount.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
Problem Statement: Scaling Interactive Bank Reconciliation
Bank reconciliation is a crucial but often complex and resource-intensive process. Suppose we have 500,000 reconciliation files stored in S3 for a two-way reconciliation process. These files are split into two categories:
250,000 bank statement files
250,000 transaction files from the company’s internal records
The objective is to reconcile these files by matching the transactions from both sources and loading the results into a database, followed by triggering reporting jobs.
Legacy System
Challenges with Current Approaches:
Sequential Processing Limitations:
A simple approach might involve iterating through the 500,000 files in a loop, but this would take an impractical amount of time. Processing even smaller datasets of 5,000 files would already show the inefficiency of sequential processing for this scale.
Data Scalability:
As the number of files increases (say, to 1 million files), this approach becomes completely unfeasible without significant performance degradation. Traditional methods are not designed to scale effectively.
Fault Tolerance:
In a large-scale operation like this, system failures can happen. If one node fails during reconciliation, the entire process could stop, requiring complex error-handling logic to ensure continuity.
Cost & Resource Management:
Balancing the cost of infrastructure with performance is another challenge. Over-provisioning resources to handle peak load times could be expensive while under-provisioning could lead to delays and failed jobs.
Complexity in Distributed Processing:
Setting up distributed processing frameworks, such as Hadoop or Spark, introduces a significant learning curve for developers who aren’t experienced with big data frameworks. Additionally, provisioning and maintaining clusters of machines adds further complexity.
AWS Step Functions, a serverless workflow orchestration service, solves these challenges efficiently by enabling scalable, distributed processing with minimal infrastructure management. With the Step Functions Distributed Map feature, large datasets like the 500,000 reconciliation files can be processed in parallel, simplifying the workflow while ensuring scalability, fault tolerance, and cost-effectiveness.
Key Benefits of the Solution:
Parallel Processing for Faster Reconciliation:
Distributed Map breaks down the 500,000 reconciliation tasks across multiple compute nodes, allowing files to be processed concurrently. This greatly reduces the time needed to reconcile large volumes of data.
Scalability:
The workflow scales effortlessly as the number of reconciliation files increases. Step Functions Distributed Map handles the coordination, ensuring that you can move from 500,000 to 1 million files without requiring a major redesign.
Fault Tolerance & Recovery:
If a node fails during the reconciliation process, the coordinator will rerun the failed tasks on another compute node, preventing the entire process from stalling. This ensures greater resilience in high-scale operations.
Cost Optimization:
As a serverless service, Step Functions automatically scales based on usage, meaning you’re only charged for what you use. There’s no need to over-provision resources, and scaling happens without manual intervention.
Developer-Friendly:
Developers don’t need to learn complex big data frameworks like Spark or Hadoop. Step Functions allows for orchestration of workflows using simple tasks and services like AWS Lambda, making it accessible to a broader range of teams.
Workflow Implementation:
The proposed Step Functions Distributed Map workflow for bank reconciliation can be broken down into the following steps:
Stage the Data:
AWS Athena is used to stage the reconciliation data, preparing it for further processing.
Gather Third-Party Data:
A Lambda function fetches any necessary third-party data, such as exchange rates or fraud detection information, to enrich the reconciliation process.
Run Distributed Map:
The Distributed Map state initiates the reconciliation between each pair of files (one from the bank statements and one from the internal records). Each pair is processed in parallel, maximizing throughput and minimizing reconciliation time.
Aggregation:
Once all pairs are reconciled, the results are aggregated into a summary report. This report is stored in a database, making the data ready for reporting and further analysis.
AWS Step Functions Distributed Map offers a scalable, fault-tolerant, and cost-effective solution to processing large datasets for bank reconciliation. Its serverless nature removes the complexity of managing infrastructure and enables developers to focus on the core business logic. By integrating services like AWS Lambda and Athena, businesses can achieve better performance and efficiency in high-scale reconciliation processes and many other use cases.
In Plain English 🚀
Thank you for being a part of the In Plain English community! Before you go:
This article outlines the system design for automating the banking reconciliation process by migrating existing manual tasks to AWS. The solution leverages various AWS services to create a scalable, secure, and efficient system. The goal is to reduce manual effort, minimize errors, and enhance operational efficiency within the financial reconciliation workflow.
Key Objectives:
Develop a user-friendly custom interface for managing reconciliation tasks.
Utilize AWS services like Lambda, Glue, S3, and EMR for data processing automation.
Implement robust security and monitoring mechanisms to ensure system reliability.
Provide post-deployment support and monitoring for continuous improvement.
Architecture Overview
The architecture comprises several AWS services, each fulfilling specific roles within the system, and integrates with corporate on-premises resources via Direct Connect.
Direct Connect: Securely connects the corporate data center to the AWS VPC, enabling fast and secure data transfer between on-premises systems and AWS services.
Data Ingestion
Amazon S3 (Incoming Files Bucket): Acts as the primary data repository where incoming files are stored. The bucket triggers the Lambda function when new data is uploaded.
Bucket Policy: Ensures that only authorized services and users can access and interact with the data stored in S3.
Event-Driven Processing
AWS Lambda: Placed in a private subnet, this function is triggered by S3 events (e.g., file uploads) and initiates data processing tasks.
IAM Permissions: Lambda has permissions to access the S3 bucket and trigger the Glue ETL job.
Data Transformation
AWS Glue ETL Job: Handles the extraction, transformation, and loading (ETL) of data from the S3 bucket, preparing it for further processing.
NAT Gateway: Located in a public subnet, the NAT Gateway allows the Lambda function and Glue ETL job to access the internet for downloading dependencies without exposing them to inbound internet traffic.
Data Processing and Storage
Amazon EMR: Performs complex transformations and applies business rules necessary for reconciliation processes, processing data securely within the private subnet.
Amazon Redshift: Serves as the central data warehouse where processed data is stored, facilitating further analysis and reporting.
RDS Proxy: Manages secure and efficient database connections between Glue ETL, EMR, and Redshift.
Business Intelligence
Amazon QuickSight: A visualization tool that provides dashboards and reports based on the data stored in Redshift, helping users to make informed decisions.
User Interface
Reconciliation UI: Hosted on AWS and integrated with RDS, this custom UI allows finance teams to manage reconciliation tasks efficiently.
Okta SSO: Manages secure user authentication via Azure AD, ensuring that only authorized users can access the reconciliation UI.
Orchestration and Workflow Management
AWS Step Functions: Orchestrates the entire workflow, ensuring that each step in the reconciliation process is executed in sequence and managed effectively.
AWS Secrets Manager: Securely stores and manages credentials needed by various AWS services.
Monitoring and Logging:
Scalyr: Provides backend log collection and analysis, enabling visibility into system operations.
New Relic: Monitors application performance and tracks key metrics to alert on any issues or anomalies.
Notifications
AWS SNS: Sends notifications to users about the status of reconciliation tasks, including completions, failures, or other important events.
Security Considerations
Least Privilege Principle: All IAM roles and policies are configured to ensure that each service has only the permissions necessary to perform its functions, reducing the risk of unauthorized access.
Encryption: Data is encrypted at rest in S3, Redshift, and in transit, meeting compliance and security standards.
Network Security: The use of private subnets, security groups, and network ACLs ensures that resources are securely isolated within the VPC, protecting them from unauthorized access.
Code Implementation
Below are the key pieces of code required to implement the Lambda function and the CloudFormation template for the AWS infrastructure.
Lambda Python Code to Trigger Glue
Here’s a Python code snippet that can be deployed as part of the Lambda function to trigger the Glue ETL job upon receiving a new file in the S3 bucket:
import json import boto3 import logging
# Set up logging logger = logging.getLogger() logger.setLevel(logging.INFO)
# Initialize the Glue and S3 clients glue_client = boto3.client('glue') s3_client = boto3.client('s3')
deflambda_handler(event, context): """ Lambda function to trigger an AWS Glue job when a new file is uploaded to S3. """ try: # Extract the bucket name and object key from the event bucket_name = event['Records'][0]['s3']['bucket']['name'] object_key = event['Records'][0]['s3']['object']['key']
# Log the file details logger.info(f"File uploaded to S3 bucket {bucket_name}: {object_key}")
# Define the Glue job name glue_job_name = "your_glue_job_name"
# Start the Glue job with the required arguments response = glue_client.start_job_run( JobName=glue_job_name, Arguments={ '--s3_input_file': f"s3://{bucket_name}/{object_key}", '--other_param': 'value'# Add any other necessary Glue job parameters here } )
# Log the response from Glue logger.info(f"Started Glue job: {response['JobRunId']}")
except Exception as e: logger.error(f"Error triggering Glue job: {str(e)}") raise e
The Lambda function code is structured as follows:
Import Libraries: Imports necessary libraries like json, boto3, and logging to handle JSON data, interact with AWS services, and manage logging.
Set Up Logging: Configures logging to capture INFO level messages, which is crucial for monitoring and debugging the Lambda function.
Initialize AWS Clients: Initializes Glue and S3 clients using boto3 to interact with these AWS services.
Define Lambda Handler Function: The main function, lambda_handler(event, context), serves as the entry point and handles events triggered by S3.
Extract Event Data: Retrieves the S3 bucket name (bucket_name) and object key (object_key) from the event data passed to the function.
Log File Details: Logs the bucket name and object key of the uploaded file to help track what is being processed.
Trigger Glue Job: Initiates a Glue ETL job using start_job_run with the S3 object passed as input, kicking off the data transformation process.
Log Job Run ID: Logs the Glue job’s JobRunId for tracking purposes, helping to monitor the job’s progress.
Error Handling: Catches and logs any exceptions that occur during execution to ensure issues are identified and resolved quickly.
IAM Role Configuration: Ensures the Lambda execution role has the necessary permissions (glue:StartJobRun, s3:GetObject, etc.) to interact with AWS resources securely.
CloudFormation Template
Below is the CloudFormation template that defines the infrastructure required for this architecture:
The Saga pattern is an architectural pattern utilized for managing distributed transactions in microservices architectures. It ensures data consistency across multiple services without relying on distributed transactions, which can be complex and inefficient in a microservices environment.
Key Concepts of the Saga Pattern
In the Saga pattern, a business process is broken down into a series of local transactions. Each local transaction updates the database and publishes an event or message to trigger the next transaction in the sequence. This approach helps maintain data consistency across services by ensuring that each step is completed before moving to the next one.
Types of Saga Patterns
There are several variations of the Saga pattern, each suited to different scenarios:
Choreography-based Saga: Each service listens for events and decides whether to proceed with the next step based on the events it receives. This decentralized approach is useful for loosely coupled services.
Orchestration-based Saga: A central coordinator, known as the orchestrator, manages the sequence of actions. This approach provides a higher level of control and is beneficial when precise coordination is required.
State-based Saga: Uses a shared state or state machine to track the progress of a transaction. Microservices update this state as they execute their actions, guiding subsequent steps.
Reverse Choreography Saga: An extension of the Choreography-based Saga where services explicitly communicate about how to compensate for failed actions.
Event-based Saga: Microservices react to events generated by changes in the system, performing necessary actions or compensations asynchronously.
Challenges Addressed by the Saga Pattern
The Saga pattern solves the problem of maintaining data consistency across multiple microservices in distributed transactions. It addresses several key challenges that arise in microservices architectures:
Distributed Transactions: In a microservices environment, a single business transaction often spans multiple services, each with its own database. Traditional ACID transactions don’t work well in this distributed context.
Data Consistency: Ensuring data consistency across different services and their databases is challenging when you can’t use a single, atomic transaction.
Scalability and Performance: Two-phase commit (2PC) protocols, which are often used for distributed transactions, can lead to performance issues and reduced scalability in microservices architectures.
Solutions Provided by the Saga Pattern
The Saga pattern solves these problems by:
Breaking down distributed transactions into a sequence of local transactions, each handled by a single service.
Using compensating transactions to undo changes if a step in the sequence fails, ensuring eventual consistency.
Flexibility in transaction management, allowing services to be added, modified, or removed without significantly impacting the overall transactional flow.
Better scalability by allowing each service to manage its own local transaction independently.
Improving fault tolerance by providing mechanisms to handle and recover from failures in the transaction sequence.
Visibility into the transaction process, which aids in debugging, auditing, and compliance.
Implementation Approaches
Choreography-Based Sagas
Decentralized Control: Each service involved in the saga listens for events and reacts to them independently, without a central controller.
Event-Driven Communication: Services communicate by publishing and subscribing to events.
Autonomy and Flexibility: Services can be added, removed, or modified without significantly impacting the overall system.
Scalability: Choreography can handle complex and frequent interactions more flexibly, making it suitable for highly scalable systems.
Orchestration-Based Sagas
Centralized Control: A central orchestrator manages the sequence of transactions, directing each service on what to do and when.
Command-Driven Communication: The orchestrator sends commands to services to perform specific actions.
Visibility and Control: The orchestrator has a global view of the saga, making it easier to manage and troubleshoot.
Choosing Between Choreography and Orchestration
When to Use Choreography
When you want to avoid creating a single point of failure.
When services need to be highly autonomous and independent.
When adding or removing services without disrupting the overall flow is a priority.
When to Use Orchestration
When you need to guarantee a specific order of execution.
When centralized control and visibility are crucial for managing complex workflows.
When you need to manage the lifecycle of microservices execution centrally.
Hybrid Approach
In some cases, a combination of both approaches can be beneficial. Choreography can be used for parts of the saga that require high flexibility and autonomy, while orchestration can manage parts that need strict control and coordination.
Challenges and Considerations
Complexity: Implementing SAGA can be more complex than traditional transactions.
Lack of Isolation: Intermediate states are visible, which can lead to consistency issues.
Error Handling: Designing and implementing compensating transactions can be tricky.
Testing: Thorough testing of all possible scenarios is crucial but can be challenging.
The Saga pattern is powerful for managing distributed transactions in microservices architectures, offering a balance between consistency, scalability, and resilience. By carefully selecting the appropriate implementation approach, organizations can effectively address the challenges of distributed transactions and maintain data consistency across their services.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
This article is an outcome of a discussion with a fellow solution architect. We were discussing the different approaches or schools of thought a solution architect might follow. If there is some disagreement, we kindly ask that you respect our point of view, and we are open to any kind of healthy discussion on this topic.
“Good architecture is like a great novel: it gets better with every reading.” — Robert C. Martin
In the field of solution architecture, there are several approaches one might take. Among them are the Problem-First Approach, Design-First Approach, Domain-Driven Design (DDD), and Agile Architecture. Each has its own focus and methodology, and the choice of approach depends on the context and specific needs of the project.
“The goal of software architecture is to minimize the human resources required to build and maintain the required system.” — Robert C. Martin
Based on the various approaches discussed, we propose a common and effective order for a solution architect to follow:
1. Problem Statement
Define and Understand the Problem: Begin by clearly defining the problem that needs to be solved. This involves gathering requirements, understanding business needs, objectives, constraints, and identifying any specific challenges. This foundational step ensures that all subsequent efforts are aligned with solving the correct issue.
“In software, the most beautiful code, the most beautiful functions, and the most beautiful programs are sometimes not there at all.” — Jon Bentley
2. High-Level Design
Develop a Conceptual Framework: Create a high-level design that outlines the overall structure of the solution. Identify major components, their interactions, data flow, and the overall system architecture. This step provides a bird’s-eye view of the solution, ensuring that all stakeholders have a common understanding of the proposed system.
“The most important single aspect of software development is to be clear about what you are trying to build.” — Bjarne Stroustrup
3. Architecture Patterns
Select Suitable Patterns: Identify and choose appropriate architecture patterns that fit the high-level design and problem context. Patterns such as microservices, layered architecture, and event-driven architecture help ensure the solution is robust, scalable, and maintainable. Selecting the right pattern is crucial for addressing the specific needs and constraints of the project.
“A pattern is a solution to a problem in a context.” — Christopher Alexander
4. Technology Stacks
Choose Technologies: Select the technology stacks that will be used to implement the solution. This includes programming languages, frameworks, databases, cloud services, and other tools that align with the architecture patterns and high-level design. Consider factors like team expertise, performance, scalability, and maintainability. The choice of technology stack has a significant impact on the implementation and long-term success of the project.
“Any sufficiently advanced technology is indistinguishable from magic.” — Arthur C. Clarke
5. Low-Level Design
Detail Each Component: Create detailed, low-level designs for each component identified in the high-level design. Specify internal structures, interfaces, data models, algorithms, and detailed workflows. This step ensures that each component is well-defined and can be effectively implemented by development teams. Detailed design documents help in minimizing ambiguities and ensuring a smooth development process.
“Good design adds value faster than it adds cost.” — Thomas C. Gale
Summary of Order:
Practical Considerations:
Iterative Feedback and Validation: Incorporate iterative feedback and validation throughout the process. Regularly review designs with stakeholders and development teams to ensure alignment with business goals and to address any emerging issues. This iterative process helps in refining the solution and addressing any unforeseen challenges.
“You can’t improve what you don’t measure.” — Peter Drucker
Documentation: Maintain comprehensive documentation at each stage to ensure clarity and facilitate communication among stakeholders. Good documentation practices help in maintaining a record of decisions and the rationale behind them, which is useful for future reference and troubleshooting.
Flexibility: Be prepared to adapt and refine designs as new insights and requirements emerge. This approach allows for continuous improvement and alignment with evolving business needs. Flexibility is key to responding effectively to changing business landscapes and technological advancements.
“The measure of intelligence is the ability to change.” — Albert Einstein
Guidelines for Selecting an Approach
Here are some general guidelines for selecting an approach:
Problem-First Approach: This approach is suitable when the problem domain is well-understood, and the focus is on finding the best solution to address the problem. It works well for projects with clear requirements and constraints.
Design-First Approach: This approach is beneficial when the system’s architecture and design are critical, and upfront planning is necessary to ensure the system meets its quality attributes and non-functional requirements.
Domain-Driven Design (DDD): DDD is a good fit for complex domains with intricate business logic and evolving requirements. It promotes a deep understanding of the domain and helps in creating a maintainable and extensible system.
Agile Architecture: An agile approach is suitable when requirements are likely to change frequently, and the team needs to adapt quickly. It works well for projects with a high degree of uncertainty or rapidly changing business needs.
Ultimately, the choice of approach should be based on a careful evaluation of the project’s specific context, requirements, and constraints, as well as the team’s expertise and the organization’s culture and processes. It’s also common to combine elements from different approaches or tailor them to the project’s needs.
“The best way to predict the future is to invent it.” — Alan Kay
Real-Life Use Case: Netflix Microservices Architecture
A notable real-life example of following a structured approach in solution architecture is Netflix’s transition to a microservices architecture. Here’s how Netflix applied a similar order in their architectural approach:
1. Problem Statement
Netflix faced significant challenges with their existing monolithic architecture, including scalability issues, difficulty in deploying new features, and handling increasing loads as their user base grew globally. The problem was clearly defined: the need for a scalable, resilient, and rapidly deployable architecture to support their expanding services.
“If you define the problem correctly, you almost have the solution.” — Steve Jobs
2. High-Level Design
Netflix designed a high-level architecture that focused on breaking down their monolithic application into smaller, independent services. This conceptual framework provided a clear vision of how different components would interact and be managed. They aimed to achieve a highly decoupled system where services could be developed and deployed independently.
3. Architecture Patterns
Netflix chose a combination of several architectural patterns to meet their specific needs:
Microservices Architecture: This pattern allowed Netflix to create independent services that could be developed, deployed, and scaled individually. Each microservice handled a specific business capability and communicated with others through well-defined APIs. This pattern provided the robustness and scalability needed to handle millions of global users.
Event-Driven Architecture: Netflix implemented an event-driven architecture to handle asynchronous communication between services. This pattern was essential for maintaining responsiveness and reliability in a highly distributed system. Services are communicated via events, allowing the system to remain loosely coupled and scalable.
Circuit Breaker Pattern: Using tools like Hystrix, Netflix adopted the circuit breaker pattern to prevent cascading failures and to manage service failures gracefully. This pattern improved the resilience and fault tolerance of their architecture.
Service Discovery Pattern: Netflix utilized Eureka for service discovery. This pattern ensured that services could dynamically locate and communicate with each other, facilitating load balancing and failover strategies.
API Gateway Pattern: Zuul was employed as an API gateway, providing a single entry point for all client requests. This pattern helped manage and route requests to the appropriate microservices, improving security and performance.
4. Technology Stacks
Netflix selected a technology stack that included:
Java: For developing the core services due to its maturity, scalability, and extensive ecosystem.
Cassandra: For data storage, providing high availability and scalability across multiple data centers.
AWS: For cloud infrastructure, offering scalability, reliability, and a wide range of managed services.
Netflix also implemented additional tools and technologies to support their architecture patterns:
Hystrix: For implementing the circuit breaker pattern.
Eureka: For service discovery and registration.
Zuul: For API gateway and request routing.
Kafka: For event-driven messaging and real-time data processing.
Spinnaker: For continuous delivery and deployment automation.
5. Low-Level Design
Detailed designs for each microservice were created, specifying how they would interact with each other, handle data, and manage failures. This included defining:
APIs: Well-defined interfaces for communication between services.
Data Models: Schemas and structures for data storage and exchange.
Communication Protocols: RESTful APIs, gRPC, and event-based messaging.
Internal Structures: Detailed workflows, algorithms, and internal component interactions.
Each microservice was developed with clear boundaries and responsibilities, ensuring a well-structured implementation. Teams were organized around microservices, allowing for autonomous development and deployment cycles.
“The details are not the details. They make the design.” — Charles Eames
Practical Considerations
Netflix continuously incorporated iterative feedback and validation through extensive testing and monitoring. They maintained comprehensive documentation for their microservices, facilitating communication and understanding among teams. Flexibility was a core principle, allowing Netflix to adapt and refine their services based on real-time performance data and user feedback.
Iterative Feedback and Validation: Netflix used canary releases, A/B testing, and real-time monitoring to gather feedback and validate changes incrementally. This allowed them to make informed decisions and continuously improve their services.
Documentation: Detailed documentation was maintained for each microservice, including API specifications, architectural decisions, and operational guidelines. This documentation was essential for onboarding new team members and ensuring consistency across the organization.
Flexibility: The architecture was designed to be adaptable, allowing Netflix to quickly respond to changing requirements and scale services as needed. Continuous integration and continuous deployment (CI/CD) practices enabled rapid iteration and deployment.
“Flexibility requires an open mind and a welcoming of new alternatives.” — Deborah Day
By adopting a combination of architecture patterns and leveraging a robust technology stack, Netflix successfully transformed their monolithic application into a scalable, resilient, and rapidly deployable microservices architecture. This transition not only addressed their immediate challenges but also positioned them for future growth and innovation.
The approach a solution architect takes can significantly impact the success of a project. By following a structured process that starts with understanding the problem, moving through high-level and low-level design, and incorporating feedback and flexibility, a solution architect can create robust, scalable, and effective solutions. This methodology not only addresses immediate business needs but also lays a strong foundation for future growth and adaptability. The case of Netflix demonstrates how applying these principles can lead to successful, scalable, and resilient architectures that support business objectives and user demands.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
This article explains how a Bank Reconciliation System is structured on AWS, with the aim of processing and reconciling banking transactions. The system automates the matching of transactions from batch feeds and provides a user interface for manually reconciling any open items.
Architecture Overview
The BRS (Bank Reconciliation System) is engineered to support high-volume transaction processing with an emphasis on automation, accuracy, and user engagement for manual interventions. The system incorporates AWS cloud services to ensure scalability, availability, and security.
Technical Flow
Batch Feed Ingestion: Transaction files, referred to as “left” and “right” feeds, are exported from an on-premises data center into the AWS environment.
Storage and Processing: Files are stored in an S3 bucket, triggering AWS Lambda functions.
Automated Reconciliation: Lambda functions process the batch feeds to perform automated matching of transactions. Matched transactions are termed “auto-match.”
Database Storage: Both the auto-matched transactions and the unmatched transactions, known as “open items,” are stored in an Amazon Aurora database.
Application Layer: A backend application, developed with Spring Boot, interacts with the database to retrieve and manage transaction data.
User Interface: An Angular front-end application presents the open items to application users (bank employees) for manual reconciliation.
System Components
AWS S3: Initial repository for batch feeds. Its event-driven capabilities trigger processing via Lambda.
AWS Lambda: The serverless compute layer that processes batch feeds and performs auto-reconciliation.
Amazon Aurora: A MySQL and PostgreSQL compatible relational database used to store both auto-matched and open transactions.
Spring Boot: Provides the backend services that facilitate the retrieval and management of transaction data for the front-end application.
Angular: The front-end framework used to build the user interface for the manual reconciliation process.
System Interaction
Ingestion: Batch feeds from the on-premises data center are uploaded to AWS S3.
Triggering Lambda: S3 events upon file upload automatically invoke Lambda functions dedicated to processing these feeds.
Processing: Lambda functions parse the batch feeds, automatically reconcile transactions where possible, and identify open items for manual reconciliation.
Storing Results: Lambda functions store the outcomes in the Aurora database, segregating auto-matched and open items.
User Engagement: The Spring Boot application provides an API for the Angular front-end, through which bank employees access and work on open items.
Manual Reconciliation: Users perform manual reconciliations via the Angular application, which updates the status of transactions within the Aurora database accordingly.
Security and Compliance
Data Encryption: All data in transit and at rest are encrypted using AWS security services.
Identity Management: Amazon Cognito ensures secure user authentication for application access.
Web Application Firewall: AWS WAF protects against common web threats and vulnerabilities.
Monitoring and Reliability
CloudWatch: Monitors the system, logging all events, and setting up alerts for anomalies.
High Availability: The system spans multiple Availability Zones for resilience and employs Elastic Load Balancing for traffic distribution.
Scalability
Elastic Beanstalk & EKS: Both services can scale the compute resources automatically in response to the load, ensuring that the BRS can handle peak volumes efficiently.
Note: When you deploy an application using Elastic Beanstalk, it automatically sets up an Elastic Load Balancer in front of the EC2 instances that are running your application. This is to distribute incoming traffic across those instances to balance the load and provide fault tolerance.
Cost Optimization
S3 Intelligent-Tiering: Manages storage costs by automatically moving less frequently accessed data to lower-cost tiers.
DevOps Practices
CodeCommit & ECR: Source code management and container image repository are handled via AWS CodeCommit and ECR, respectively, streamlining the CI/CD pipeline.
The BRS leverages AWS services to create a seamless, automated reconciliation process, complemented by an intuitive user interface for manual intervention, ensuring a robust solution for the bank’s reconciliation needs.
Stackademic 🎓
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
System design interviews can be daunting due to their complexity and the vast knowledge required to excel. Whether you’re a recent graduate or a seasoned engineer, preparing for these interviews necessitates a well-thought-out strategy and access to the right resources. In this article, I’ll guide you to navigate the system design landscape and equip you to succeed in your upcoming interviews.
Start with the Basics
“Web Scalability for Startup Engineers” by Artur Ejsmont — This book is recommended as a starting point for beginners in system design.
“Designing Data-Intensive Applications” by Martin Kleppmann is described as a more in-depth resource for those with a basic understanding of system design.
It’s essential to establish a strong foundation before delving too deep into a subject. For beginners, “Web Scalability for Startup Engineers” is an excellent resource. It covers the basics and prepares you for more advanced concepts. After mastering the fundamentals, “Designing Data-Intensive Applications” by Martin Kleppmann will guide you further into data systems.
Microservices and Domain-Driven Design
“Building Microservices” by Sam Newman — Focuses on microservices architecture and its implications in system design.
Once you are familiar with the fundamentals, the next step is to explore the intricacies of the microservices architectural style through “Building Microservices.” To gain a deeper understanding of practical patterns and design principles, “Microservices Patterns and Best Practices” is an excellent resource. Lastly, for those who wish to understand the philosophy behind system architecture, “Domain-Driven Design” is a valuable read.
API Design and gRPC
“RESTful Web APIs” by Leonard Richardson, Mike Amundsen, and Sam Ruby provides a comprehensive guide to developing web-based APIs that adhere to the REST architectural style.
In the present world, APIs serve as the main connecting point of the internet. If you intend to design effective APIs, a good starting point would be to refer to “RESTful Web APIs” by Leonard Richardson and his colleagues. Moreover, if you are exploring the Remote Procedure Call (RPC) genre, particularly gRPC, then “gRPC: Up and Running” is a comprehensive guide.
Preparing for the Interview
“System Design Interview — An Insider’s Guide” by Alex Xu is an essential book for those preparing for challenging system design interviews.
It offers a comprehensive look at the strategies and thought processes required to navigate these complex discussions. Although it is one of many resources candidates will need, the book is tailored to equip them with the means to dissect and approach real interview questions. The book blends technical knowledge with the all-important communicative skills, preparing candidates to think on their feet and articulate clear and effective system design solutions. Xu’s guide demystifies the interview experience, providing a rich set of examples and insights to help candidates prepare for the interview process.
Domain-Specific Knowledge
Enhance your knowledge in your domain with books such as “Kafka: The Definitive Guide” for Distributed Messaging and “Cassandra: The Definitive Guide” for understanding wide-column stores. “Designing Event-Driven Systems” is crucial for grasping event sourcing and services using Kafka.
General Product Design
Pay attention to product design in system design. Books like “The Design of Everyday Things” and “Hooked: How to Build Habit-Forming Products” teach user-centric design principles, which are increasingly crucial in system design.
Online Resources
The internet is a goldmine of information. You can watch tech conference talks, follow YouTube channels such as Gaurav Sen’s System Design Interview and read engineering blogs from companies like Uber, Netflix, and LinkedIn.
System design is an iterative learning process that blends knowledge, curiosity, and experience. The resources provided here are a roadmap to guide you through this journey. With the help of these books and resources, along with practice and reflection, you will be well on your way to mastering system design interviews. Remember, it’s not just about understanding system design but also about thinking like a system designer.
Stackademic
Thank you for reading until the end. Before you go:
Please consider clapping and following the writer! 👏
As cloud computing continues to evolve, microservices architectures are becoming increasingly complex. To effectively manage this complexity, service meshes are being adopted. In this article, we will explain what a service mesh is, why it is necessary for modern cloud architectures, and how it addresses some of the most pressing challenges developers face today.
Understanding the Service Mesh
A service mesh is a configurable infrastructure layer built into an application that allows for the facilitation of flexible, reliable, and secure communications between individual service instances. Within a cloud-native environment, especially one that embraces containerization, a service mesh is critical in handling service-to-service communications, allowing for enhanced control, management, and security.
Why a Service Mesh?
As applications grow and evolve into distributed systems composed of many microservices, they often encounter challenges in service discovery, load balancing, failure recovery, security, and observability. A service mesh addresses these challenges by providing:
Dynamic Traffic Management: Adjusting the flow of requests and responses to accommodate changes in the infrastructure.
Improved Resiliency: Adding robustness to the system with patterns like retries, timeouts, and circuit breakers.
Enhanced Observability: Offering tools for monitoring, logging, and tracing to understand system performance and behaviour.
Security Enhancements: Ensuring secure communication through encryption and authentication protocols.
By implementing a service mesh, these distributed and loosely coupled applications can be managed more effectively, ensuring operational efficiency and security at scale.
Foundational Elements: Service Discovery and Proxies
The service mesh relies on two essential components — Consul and Envoy. The consul is responsible for service discovery, which means it keeps track of services, locations, and health status. It ensures that the system can adapt to changes in the environment. On the other hand, Envoy manages proxy services. It’s deployed alongside service instances and handles network communication. Envoy acts as an abstraction layer for traffic management and message routing.
Architectural Overview
The architecture consists of a Public and Private VPC setup, which encloses different clusters. The ‘LEFT_CLUSTER’ in the VPC is dedicated to critical services like logging and monitoring, which provide insights into the system’s operation and manage transactions. On the other hand, the ‘RIGHT_CLUSTER’ in the VPC contains services for Audit and compliance, Dashboards, and Archived Data, ensuring a robust approach to data management and regulatory compliance.
The diagram shows a service mesh architecture for sensitive banking operations in AWS. It comprises two clusters: the Left Cluster ( VPC) includes a Mesh Gateway, Bank Interface, Authentication and Authorization systems, and a Reconciliation Engine. Right Cluster (VPC) manages Audit, provides a Dashboard, stores Archived Data, and handles Notifications. Consul and Envoy Proxies efficiently manage communication. Monitored by dedicated tools, it ensures operational integrity and security in a complex banking ecosystem.
Mesh Gateways and Envoy Proxies
Mesh Gateways are crucial for inter-cluster communication, simplifying connectivity and network configurations. Envoy Proxies are strategically placed within the service mesh, managing the flow of traffic and enhancing the system’s ability to scale dynamically.
Security and User Interaction
The user’s journey begins with the authentication and authorization measures in place to verify and secure user access.
The Role of Consul
Consul’s service discovery capabilities are essential in allowing services like the Bank Interface and the Reconciliation Engine to discover each other and interact seamlessly, bypassing the limitations of static IP addresses.
Operational Efficiency
The service mesh’s contribution to operational efficiency is particularly evident in its integration with the Reconciliation Engine. This ensures that financial data requiring reconciliation is processed efficiently, securely, and directed towards the relevant services.
The Case for Service Mesh Integration
The shift to cloud-native architecture emphasizes the need for service meshes. This blueprint enhances agility, security, and technology, affirming the service mesh as pivotal for modern cloud networking.
In Plain English
Thank you for being a part of our community! Before you go:
In this article, I talk about how to build a system like Twitter. I focus on the problems that come up when very famous people, like Elon Musk, tweet and many people see it at once. I’ll share the basic steps, common issues, and how to keep everything running smoothly. My goal is to give you a simple guide on how to make and run such a system.
System Requirements
Functional Requirements:
User Management: Includes registration, login, and profile management.
Tweeting: Enables users to broadcast short messages.
Retweeting: This lets users share others’ content.
Timeline: Showcases tweets from the user and those they follow.
Non-functional Requirements:
Scalability: Must accommodate millions of users.
Availability: High uptime is the goal, achieved through multi-regional deployments.
Latency: Prioritizes real-time data retrieval and instantaneous content updates.
Security: Ensures protection against unauthorized breaches and data attacks.
Architecture Overview
This diagram outlines a microservices-based social media platform design. The user’s request flows through a CDN, then a load balancer to distribute the load among web servers. Core services and data storage solutions like DynamoDB, Blob Storage, and Amazon RDS are defined. An intermediary cache ensures fast data retrieval, and the Amazon Elasticsearch Service provides advanced search capabilities. Asynchronous tasks are managed through SQS, and specialized services for trending topics, direct messaging, and DDoS mitigation are included for a holistic approach to user experience and security.
Scalability
Load Balancer: Directs traffic to multiple servers to balance the load.
Microservices: Functional divisions ensure scalability without interference.
Auto Scaling: Adjusts resources based on the current demand.
Data Replication: Databases like DynamoDB replicate data across different locations.
CDN: Content Delivery Networks ensure swift asset delivery, minimizing latency.
Security
Authentication: OAuth 2.0 for stringent user validation.
Authorization: Role-Based Access Control (RBAC) defines user permissions.
Encryption: SSL/TLS for data during transit; AWS KMS for data at rest.
DDoS Protection: AWS Shield protects against volumetric attacks.
Data Design (NoSQL, e.g., DynamoDB)
User Table
Tweets Table
Timeline Table
Multimedia Content Storage (Blob Storage)
In the multimedia age, platforms akin to Twitter necessitate a system adept at managing images, GIFs, and videos. Blob storage, tailored for unstructured data, is ideal for efficiently storing and retrieving multimedia content, ensuring scalable, secure, and prompt access.
Backup Databases
In the dynamic world of microblogging, maintaining data integrity is imperative. Backup databases offer redundant data copies, shielding against losses from hardware mishaps, software anomalies, or malicious intents. Strategically positioned backup databases bolster quick recovery, promoting high availability.
Queue Service
The real-time interaction essence of platforms like Twitter underscores the importance of the Queue Service. This service is indispensable when managing asynchronous tasks and coping with sudden traffic influxes, especially with high-profile tweets. This queuing system:
Handles requests in an orderly fashion, preventing server inundations.
Decouples system components, safeguarding against cascading failures.
Preserves system responsiveness during high-traffic episodes.
Workflow Design
Standard Workflow
Tweeting: User submits a tweet → Handled by the Tweet Microservice → Authentication & Authorization → Stored in the database → Updated on the user’s timeline and followers’ timelines.
Retweeting: User shares another’s tweet → Retweet Microservice handles the action → Authentication & Authorization → The retweet is stored and updated on timelines.
Timeline Management: A user’s timeline combines tweets, retweets, and tweets from users they follow. Caching mechanisms like Redis can enhance timeline retrieval speed for frequently accessed ones.
Enhanced Workflow Design
Tweeting by High-Profile Users (high retrieval rate):
Tweet Submission: Elon Musk (or any high-profile user) submits a tweet.
Tweet Microservice Handling: The tweet is directed to the Tweet Microservice via the Load Balancer. Authentication and Authorization checks are executed.
Database Update: Once approved, the tweet is stored in the Tweets Table.
Deferred Update for Followers: High-profile tweets can be efficiently disseminated without overloading the system using a publish/subscribe (Pub/Sub) mechanism.
Caching: Popular tweets, due to their high retrieval rate, benefit from caching mechanisms and CDN deployments.
Notifications: A selective notification system prioritizes active or frequent interaction followers for immediate notifications.
Monitoring and Auto-scaling: Resources are adjusted based on real-time monitoring to handle activity surges post high-profile tweets.
Advanced Features and Considerations
Though the bedrock components of a Twitter-esque system are pivotal, integrating advanced features can significantly boost user experience and overall performance.
Trending Topics and Analytics
A hallmark of platforms like Twitter is real-time trend spotting. An ever-watchful service can analyze tweets for patterns, hashtags, or mentions, displaying live trends. Combined with analytics, this offers insights into user patterns and preferences, peak tweeting times, and favoured content.
Direct Messaging
Given the inherently public nature of tweets, a direct messaging system serves as a private communication channel. This feature necessitates additional storage, retrieval mechanisms, and advanced encryption measures to preserve the sanctity of private interactions.
Push Notifications
To foster user engagement, real-time push notifications can be implemented. These alerts can inform users about new tweets, direct messages, mentions, or other salient account activities, ensuring the user stays connected and engaged.
Search Functionality
With the exponential growth in tweets and users, a sophisticated search mechanism becomes indispensable. An advanced search service, backed by technologies like ElasticSearch, can render the task of content discovery effortless and precise.
Monetization Strategies
Integrating monetisation mechanisms is paramount to ensure the platform’s sustainability and profitability. This includes display advertisements, promoted tweets, business collaborations, and more. However, striking a balance is crucial, ensuring these monetization strategies don’t intrude on the user experience.
To make a site like Twitter, you need a good system, strong safety, and features people like. Basic things like balancing traffic, organizing data, and keeping it safe are a must. But what really makes a site stand out are the new and advanced features. By thinking carefully about all these things, you can build a site that’s big and safe, but also fun and easy for people to use.
If you enjoyed reading this and would like to explore similar content, please refer to the following link: