Daily Archives: July 22, 2024

Bulkhead Architecture Pattern: Data Security & Governance

Today during an Azure learning session focused on data security and governance, our instructor had to leave unexpectedly due to a personal emergency. Reflecting on the discussion and drawing from my background in fintech and solution architecture, I believe it would be beneficial to explore an architecture pattern relevant to our conversation: the Bulkhead Architecture Pattern.

Inspired by ship design, the Bulkhead architecture pattern divides the base of a ship into partitions called bulkheads. This ensures that if there’s a leak in one section, it doesn’t sink the entire ship; only the affected partition fills with water. Translating this principle to software architecture, the pattern focuses on fault isolation by decomposing a monolithic architecture into a microservices architecture.

Use Case: Bank Reconciliation Reporting

Consider a scenario involving trade data across various regions such as APAC, EMEA, LATAM, and NAM. Given the regulatory challenges related to cross-country data movement, ensuring proper data governance when consolidating data in a data warehouse environment becomes crucial. Specifically, it is essential to manage the challenge of ensuring that data from India cannot be accessed from the NAM region and vice versa. Additionally, restricting data movement at the data centre level is critical.

Microservices Isolation

  • Microservices A, B, C: Each microservice is deployed in its own Azure Kubernetes Service (AKS) cluster or Azure App Service.
  • Independent Databases: Each microservice uses a separate database instance, such as Azure SQL Database or Cosmos DB, to avoid single points of failure.

Network Isolation

  • Virtual Networks (VNets): Each microservice is deployed in its own VNet. Use Network Security Groups (NSGs) to control inbound and outbound traffic.
  • Private Endpoints: Secure access to Azure services (e.g., storage accounts, databases) using private endpoints.

Load Balancing and Traffic Management

  • Azure Front Door: Provides global load balancing and application acceleration for microservices.
  • Application Gateway: Offers application-level routing and web application firewall (WAF) capabilities.
  • Traffic Manager: A DNS-based traffic load balancer for distributing traffic across multiple regions.

Service Communication

  • Service Bus: Use Azure Service Bus for decoupled communication between microservices.
  • Event Grid: Event-driven architecture for handling events across microservices.

Fault Isolation and Circuit Breakers

  • Polly: Implement circuit breakers and retries within microservices to handle transient faults.
  • Azure Functions: Use serverless functions for non-critical, independently scalable tasks.

Data Partitioning and Isolation

  • Sharding: Partition data across multiple databases to improve performance and fault tolerance.
  • Data Sync: Use Azure Data Sync to replicate data across regions for redundancy.

Monitoring and Logging

  • Azure Monitor: Centralized monitoring for performance and availability metrics.
  • Application Insights: Deep application performance monitoring and diagnostics.
  • Log Analytics: Aggregated logging and querying for troubleshooting and analysis.

Advanced Threat Protection

  • Azure Defender for Storage: Enable Azure Defender for Storage to detect unusual and potentially harmful attempts to access or exploit storage accounts.

Key Points

  • Isolation: Each microservice and its database are isolated in separate clusters and databases.
  • Network Security: VNets and private endpoints ensure secure communication.
  • Resilience: Circuit breakers and retries handle transient faults.
  • Monitoring: Centralized monitoring and logging for visibility and diagnostics.
  • Scalability: Each component can be independently scaled based on load.

Bulkhead Pattern Concepts

Isolation

The primary goal of the Bulkhead pattern is to isolate different parts of a system to contain failures within a specific component, preventing them from cascading and affecting the entire system. This isolation can be achieved through various means such as separate thread pools, processes, or containers.

Fault Tolerance

By containing faults within isolated compartments, the Bulkhead pattern enhances the system’s ability to tolerate failures. If one component fails, the rest of the system can continue to operate normally, thereby improving overall reliability and stability.

Resource Management

The pattern helps in managing resources efficiently by allocating specific resources (like CPU, memory, and network bandwidth) to different components. This prevents resource contention and ensures that a failure in one component does not exhaust resources needed by other components.

Implementation Examples in K8s

Kubernetes

An example of implementing the Bulkhead pattern in Kubernetes involves creating isolated containers for different services, each with its own CPU and memory resources and limits. This configuration is for a service called payment-processing.

apiVersion: v1
kind: Pod
metadata:
name: payment-processing
spec:
containers:
- name: payment-processing-container
image: payment-service:latest
resources:
requests:
memory: "128Mi"
cpu: "500m"
limits:
memory: "256Mi"
cpu: "2"
---
apiVersion: v1
kind: Pod
metadata:
name: order-management
spec:
containers:
- name: order-management-container
image: order-service:latest
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
name: inventory-control
spec:
containers:
- name: inventory-control-container
image: inventory-service:latest
resources:
requests:
memory: "96Mi"
cpu: "300m"
limits:
memory: "192Mi"
cpu: "1.5"

In this configuration:

  • The payment-processing service is allocated 128Mi of memory and 500m of CPU as a request, with limits set to 256Mi of memory and 2 CPUs.
  • The order-management service has its own isolated resources, with 64Mi of memory and 250m of CPU as a request, and limits set to 128Mi of memory and 1 CPU.
  • The inventory-control service is given 96Mi of memory and 300m of CPU as a request, with limits set to 192Mi of memory and 1.5 CPUs.

This setup ensures that each service operates within its own resource limits, preventing any single service from exhausting resources and affecting the others.

Hystrix

Hystrix, a Netflix API for latency and fault tolerance, uses the Bulkhead pattern to limit the number of concurrent calls to a component. This is achieved through thread isolation, where each component is assigned a separate thread pool, and semaphore isolation, where callers must acquire a permit before making a request. This prevents the entire system from becoming unresponsive if one component fails.

Ref: https://github.com/Netflix/Hystrix

AWS App Mesh

In AWS App Mesh, the Bulkhead pattern can be implemented at the service-mesh level. For example, in an e-commerce application with different API endpoints for reading and writing prices, resource-intensive write operations can be isolated from read operations by using separate resource pools. This prevents resource contention and ensures that read operations remain unaffected even if write operations experience a high load.

Benefits

  • Fault Containment: Isolates faults within specific components, preventing them from spreading and causing systemic failures.
  • Improved Resilience: Enhances the system’s ability to withstand unexpected failures and maintain stability.
  • Performance Optimization: Allocates resources more efficiently, avoiding bottlenecks and ensuring consistent performance.
  • Scalability: Allows independent scaling of different components based on workload demands.
  • Security Enhancement: Reduces the attack surface by isolating sensitive components, limiting the impact of security breaches.

The Bulkhead pattern is a critical design principle for constructing resilient, fault-tolerant, and efficient systems by isolating components and managing resources effectively.

Stackademic 🎓

Thank you for reading until the end. Before you go:

Apache Hive 101: MSCK Repair Table

The MSCK REPAIR TABLE command in Hive is used to update the metadata in the Hive metastore to reflect the current state of the partitions in the file system. This is particularly necessary for external tables where partitions might be added directly to the file system (such as HDFS or Amazon S3) without using Hive commands.

What MSCK REPAIR TABLE Does

  1. Scans the File System: It scans the file system (e.g., HDFS or S3) for Hive-compatible partitions that were added after the table was created.
  2. Updates Metadata: It compares the partitions in the table metadata with those in the file system. If it finds new partitions in the file system that are not in the metadata, it adds them to the Hive metastore.
  3. Partition Detection: It detects partitions by reading the directory structure and creating partitions based on the folder names.

Why MSCK REPAIR TABLE is Needed

  1. Partition Awareness: Hive stores a list of partitions for each table in its metastore. When new partitions are added directly to the file system, Hive is not aware of these partitions unless the metadata is updated. Running MSCK REPAIR TABLE ensures that the Hive metastore is synchronized with the actual data layout in the file system.
  2. Querying New Data: Without updating the metadata, queries on the table will not include the data in the new partitions. By running MSCK REPAIR TABLE, you make the new data available for querying.
  3. Automated Ingestion: For workflows that involve automated data ingestion, running MSCK REPAIR TABLE after each data load ensures that the newly ingested data is recognized by Hive without manually adding each partition.

Command to Run MSCK REPAIR TABLE

MSCK REPAIR TABLE table_name;

Replace table_name with the name of your Hive table.

Considerations and Limitations

  1. Performance: The operation can be slow, especially with a large number of partitions, as it involves scanning the entire directory structure.
  2. Incomplete Updates: If the operation times out, it may leave the table in an incomplete state where only some partitions are added. It may be necessary to run the command multiple times until all partitions are included.
  3. Compatibility: MSCK REPAIR TABLE only adds partitions to metadata; it does not remove them. For removing partitions, other commands like ALTER TABLE DROP PARTITION must be used.
  4. Hive Compatibility: Partitions must be Hive-compatible. For partitions that are not, manual addition using ALTER TABLE ADD PARTITION is required.