Monthly Archives: August 2023

System Design Interview: Serverless Web Crawler using AWS

Architecture Overview:

The main components of our serverless crawler are Lambda functions, an SQS queue, and a DynamoDB table. Here’s a breakdown:

  • Lambda: Two distinct functions — one for initiating the crawl and another for the actual processing.
  • SQS: Manages pending crawl tasks as a buffer and task distributor.
  • DynamoDB: Stores visited URLs, ensuring we avoid redundant visits.

Workflow & Logic Rationale:

Initiation:

Starting Point (Root URL):

Logic: The crawl starts with a root URL, e.g., “www.shanoj.com”.

Rationale: A defined beginning allows the crawler to commence in a guided manner.

Uniqueness with UUID:

Logic: A unique run ID is generated for every crawl to ensure distinction.

Rationale: These guards against potential data overlap in the case of concurrent crawls.

Avoiding Redundant Visits:

Logic: The root URL is pre-emptively marked as “visited”.

Rationale: This step is integral to maximizing efficiency by sidestepping repeated processing.

The URL then finds its way to SQS, awaiting crawling.

Processing:

Link Extraction:

Logic: A secondary Lambda function polls SQS for URLs. Once a URL is retrieved, the associated webpage is fetched. All the links are identified and extracted within this webpage for further processing.

Rationale: Extracting all navigable paths from our current location is pivotal to web exploration.

Depth-First Exploration Strategy:

Logic: Extracted links undergo a check against DynamoDB. If previously unvisited, they’re designated as such in the database and enqueued back into SQS.

Rationale: This approach delves deep into one link’s pathways before backtracking, optimizing resource utilization.

Special Considerations:

A challenge for web crawlers is the potential for link loops, which can usher in infinite cycles. By verifying the “visited” status of URLs in DynamoDB, we proactively truncate these cycles.

Back-of-the-Envelope Estimation for Web Crawling:

1. Data Download:

  • Webpages per month: 1 billion
  • The average size of a webpage: 500 KB

Total data downloaded per month:

1,000,000,000 (webpages) × 500 KB = 500,000,000,000 KB

or 500 TB (terabytes) of data every month.

2. Lambda Execution:

Assuming that the Lambda function needs to be invoked for each webpage to process and extract links:

  • Number of Lambda executions per month: 1 billion

(One would need to further consider the execution time for each Lambda function and the associated costs)

3. DynamoDB Storage:

Let’s assume that for each webpage, we store only the URL and some metadata which might, on average, be 1 KB:

  • Total storage needed for DynamoDB per month:
  • 1,000,000,000 (webpages) × 1 KB = 1,000,000,000 KB
  • or 1 TB of data storage every month.

(However, if you’re marking URLs as “visited” and removing them post the crawl, then the storage might be significantly less on a persistent basis.)

4. SQS Messages:

Each webpage URL to be crawled would be a message in SQS:

  • Number of SQS messages per month: 1 billion

The system would require:

  • 500 TB of data storage and transfer capacity for the actual web pages each month.
  • One billion Lambda function executions monthly for processing.
  • About 1 TB of storage in DynamoDB might vary based on retention and removal strategies.
  • One billion SQS messages to manage the crawl queue.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

AWS-Based URL Shortener: Design, Logic, and Scalability

Here’s a behind-the-scenes look at creating a URL-shortening service using Amazon Web Services (AWS).

Users and System Interaction:

  • User Requests: Users submit a long web address wanting a shorter version, or they might want to use a short link to reach the original website or remove a short link.
  • API Gateway: This is AWS’s reception. It directs user requests to the right service inside AWS.
  • Lambda Functions: These are the workers. They perform tasks like making a link shorter, retrieving the original from a short link, or deleting a short link.
  • DynamoDB: This is the storage room. All the long and short web addresses are stored here.
  • ElastiCache: Before heading to DynamoDB, the system checks here first when users access a short link. It’s faster.
  • VPC & Subnets: This is the AWS structure. The welcoming part (API Gateway) is public, while sensitive data (DynamoDB) is kept private and secure.

Making Links Shorter for Users:

  • Sequential Counting: Every web link gets a unique number. To keep it short, that number is converted into a combination of letters and numbers.
  • Hashing: The system also shortens the long web address into a fixed-length string. This method may produce similar results for different links, but the system manages and differentiates them efficiently.

Sequential Counting: This takes a long URL as input and uses a unique counter value from the database to generate a short URL.

For instance, the URL https://example.com/very-long-url might be shortened to https://short.url/1234AB using a unique number from the database, then converting this number into a mix of letters and numbers.

Hashing: This involves taking a long URL and converting it to a fixed-size string of characters using a hashing algorithm. So, https://example.com/very-long-url could become https://short.url/h5Gk9.

The rationale for Combining:

  1. Enhanced Uniqueness & Collision Handling: Sequential counting ensures uniqueness, and in the unlikely event of a hashing collision, the sequential identifier can be used as a fallback or combined with the hash.
  2. Balancing Predictability & Compactness: Hashing gives compact URLs, and by adding a sequential component, we reduce predictability.
  3. Scalability & Performance: Sequential lookups are faster. If the hash table grows large, the performance could degrade due to hash collisions. Combining with sequential IDs ensures fast retrievals.

Lambda Function for Shortening (PUT Request)

  1. Input: Long URL e.g. “https://www.example.com/very-long-url
  2. URL Exists: Retrieved Shortened URL e.g. “abcd12”
  3. Hash URL: Output e.g. “a1b2c3”
  4. Assign Number: Unique Sequential Number e.g. “456”
  5. Combine Hash & Number: e.g. “a1b2c3456”
  6. Store in DynamoDB: {“https://www.example.com/very-long-url“: “a1b2c3456”}
  7. Update ElastiCache: {“a1b2c3456”: “https://www.example.com/very-long-url”}
  8. Return to API Gateway: Shortened URL e.g. “a1b2c3456”

Lambda Function for Redirecting (GET Request)

  • Input: The user provides a short URL like “a1b2c3456”.
  • Check-in ElastiCache: System looks up the short URL in ElastiCache.
  • Cache Hit: If the Long URL is found in the cache, the system retrieves it directly.
  • Cache Miss: If not in the cache, the system searches in DynamoDB.
  • Check-in DynamoDB: Searches the DynamoDB for the corresponding Long URL.
  • URL Found: The Long URL matching the given short URL is found, e.g. “https://www.example.com/very-long-url“.
  • Update ElastiCache: System updates the cache with {“a1b2c3456”: “https://www.example.com/very-long-url”}.
  • Return to API Gateway: The system redirects users to the original Long URL.

Lambda Function for Deleting (DELETE Request)

  • Input: The user provides a short URL they want to delete.
  • Check-in DynamoDB: System looks up the short URL in DynamoDB.
  • URL Found: If the URL mapping for the short URL is found, it proceeds to deletion.
  • Delete from DynamoDB: The system deletes the URL mapping from DynamoDB.
  • Clear from ElastiCache: The System also clears the URL mapping from the cache to ensure that the short URL no longer redirects users.
  • Return Confirmation to API Gateway: After the deletion is successful, a confirmation is sent to the API Gateway, confirming the user about the deletion.

Simple Math Behind Our URL Shortening (Envelope Estimation):

When we use a 6-character mix of letters (both small and capital) and numbers for our short URLs, we have about 56.8 billion different combinations. If users create 100 million short links every day, we can keep making unique links for over 500 days without repeating them.

In Plain English

Thank you for being a part of our community! Before you go:

Angular &Microfrontends: Toy Blocks to Web Blocks

When I was a child, my playtime revolved around building vibrant cities with my toy blocks. I would carefully piece them together, ensuring each block had its own space and significance. As a seasoned architect with over two decades of industry experience, I’ve transitioned from tangible to digital blocks. The essence remains unchanged: creating structured and efficient designs.

Microfrontends:

Much like the city sectors of my childhood imaginations, microfrontends offer modularity, allowing different parts of a web application to evolve independently yet harmoniously. Angular’s intrinsic modular nature seamlessly aligns with this. This modular structure can be imagined as various sectors or boroughs of a digital city, each having its unique essence yet forming part of the larger metropolis.

AG Grid:

In my toy block city, streets and avenues ensured connectivity. AG Grid performs a similar function in our digital city, giving structure and clarity to vast amounts of data. With Angular, integrating AG Grid feels as natural as laying down roads on a plain.

<ag-grid-angular
style="width: 100%; height: 500px;"
class="ag-theme-alpine"
[rowData]="myData"
[columnDefs]="myColumns">
</ag-grid-angular>

These grids act as pathways, guiding the user through the information landscape.

Web Components and Angular Elements:

In the heart of my miniature city, unique buildings stood tall, each with its distinct architecture. Web components in our digital city reflect this individuality. They encapsulate functionality and can be reused across applications, making them the skyscrapers of our application. With Angular Elements, creating these standalone skyscrapers becomes a breeze.

import { createCustomElement } from '@angular/elements'

@NgModule({
  entryComponents: [DashboardComponent]
})
export class DashboardModule {
  constructor(injector: Injector) {
    const customElement = createCustomElement(DashboardComponent, { injector });
    customElements.define('my-dashboard', customElement);
  }
};

Webpack and Infrastructure:

Beneath my toy city lay an imaginary network of tunnels and infrastructure. Similarly, Webpack operates behind the scenes in our digital realm, ensuring our Angular applications are optimized and efficiently bundled.

const AngularWebpackPlugin = require('@ngtools/webpack')

module.exports = {
  // ...
  module: {
    rules: [
      {
        test: /(?:\.ngfactory\.js|\.ngstyle\.js|\.ts)$/,
        loader: '@ngtools/webpack'
      }
    ]
  },
  plugins: [
    new AngularWebpackPlugin()
  ]
};;

Manfred Steyer:

In every narrative, there’s an inspiration. For me, that beacon has been Manfred Steyer. His contributions to the Angular community have been invaluable. His insights into microfrontends and architecture greatly inspired my journey. Manfred’s eBook (https://www.angulararchitects.io/en/book/) is a must-read for those yearning to deepen their understanding.

From the joys of childhood toy blocks to the complex software architectures today, the essence of creation is unchanging. Tools like Module Federation, Angular, Webpack, AG-Grid, and WebComponents, combined with foundational structures like the Shell, empower us not just to build but to envision and innovate.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

REST vs. GraphQL: Tale of Two Hotel Waiters

Imagine visiting a grand hotel with two renowned restaurants. In the first restaurant, the waiter, named REST, serves a fixed menu. You get a three-course meal whether you’re hungry for all of it or not. In the second restaurant, the waiter, GraphQL, takes custom orders. You specify whether you want just an appetizer or the whole deal, and GraphQL brings exactly that.

The Role of Waiters (APIs)

Both REST and GraphQL, like our waiters, serve as intermediaries. They’re like hotel waiters fetching what you, the diner (or in tech terms, the user), ask for from the kitchen (the server or database). It’s how apps and websites get the data they need.

Meet Waiter REST

REST, the waiter from the first restaurant, is efficient and follows a set protocol. When you sit at his table, he serves using distinct methods (like Get or Post). REST ensures you get the full experience of the hotel’s menu but might serve more than your appetite demands.

Introducing Waiter GraphQL

GraphQL, on the other hand, listens intently to your cravings. He allows you to specify exactly what you’re hungry for using a ‘schema’ — a menu that outlines what dishes are available. If you fancy a dish that needs ingredients from different parts of the kitchen, GraphQL brings it all together in one well-presented plate.

Shared Service Traits

  1. Both waiters ensure a memorable dining experience, enabling apps and websites to fetch data.
  2. They have standardized methods, simplifying the ordering process.
  3. Both serve their dishes (or data) in a universally appealing manner, often using formats like JSON.

Distinguishing Their Service

  1. Volume of Dishes: REST serves the entire menu, while GraphQL offers customized options based on your preferences.
  2. Efficiency: REST might need multiple rounds to the kitchen for various courses. GraphQL, however, gathers everything you need in one trip.
  3. Familiarity: REST, having served in the industry for longer, is a familiar face to many. GraphQL, the newer waiter, might need some introduction.

Choosing Your Dining Experience

  • REST is great for a comprehensive experience. If you’re not sure what you want and wish to try everything, REST ensures you don’t miss out.
  • GraphQL is perfect for a tailored experience. If you know your cravings and desire a specific mix of dishes, GraphQL is your go-to.

Interestingly, many modern hotels (or tech platforms) employ both waiters, ensuring guests always have the dining experience they prefer.

Chef REST’s Dishes

With REST, if you order the “Spaghetti” dish, Chef REST provides you with his classic spaghetti, meatballs, and a side of garlic bread, even if you only wanted spaghetti and meatballs.

REST Request: (Ordering Spaghetti)

import request

response = requests.get('https://api.hotelmenu.com/dishes/spaghetti')
dish_details = response.json()

print(dish_details)

# The server might respond with:
# {
# "dish": "Spaghetti",
# "ingredients": ["spaghetti", "meatballs", "garlic bread"]
# }

Chef GraphQL’s Custom Dishes

With Chef GraphQL, if you only want spaghetti and meatballs without the garlic bread, you specify those ingredients in your order.

GraphQL Query: (Ordering customized Spaghetti)

import request

url = 'https://api.hotelmenu.com/graphql'
headers = {'Content-Type': 'application/json'}
query = {
"query": """
{
dish(name: "Spaghetti") {
ingredients(includes: ["spaghetti", "meatballs"])
}
}
"""
}

response = requests.post(url, json=query, headers=headers)
custom_dish = response.json()

print(custom_dish)

# The server might respond with:
# {
# "data": {
# "dish": {
# "ingredients": ["spaghetti", "meatballs"]
# }
# }
# }

Now, with these Python examples, you can directly see how our two waiters, REST and GraphQL, serve data in the tech realm.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

API-First Software Development: A Paradigm Shift for Modern Organizations

In the fast-paced world of software development, organizations are constantly seeking innovative approaches to enhance their agility, scalability, and interoperability. One such approach that has gained significant attention is API-first software development. Recently, I stumbled upon an enlightening article by Joyce Lin titled “API-First Software Development for Modern Organizations,” it struck a chord with my perception of this transformative methodology.

API-first development prioritizes APIs in software design to create strong interconnected systems. It’s a game-changer for modern organizations and Lin explains the principles well.

The concept of separation of concerns particularly resonated with me. By decoupling backend services and frontend/client applications, API-first development enables teams to work independently and in parallel. This separation liberates developers to focus on their specific areas of expertise, allowing for faster development cycles and empowering collaboration across teams. The API acts as the bridge, the bond that seamlessly connects these disparate components into a cohesive whole.

Moreover, Lin emphasizes the scalability and reusability inherent in API-first development. APIs inherently promote modularity, providing clear boundaries and well-defined contracts. This modularity not only facilitates code reuse within a project but also fosters reusability across different projects or even beyond organizational boundaries. It’s a concept that aligns perfectly with my belief in the power of building on solid foundations and maximizing efficiency through code reuse.

Another crucial aspect Lin highlights is the flexibility and innovation that API-first development brings to the table. By designing APIs as the primary concern, organizations open the doors to experimentation, enabling teams to explore new technologies, frameworks, and languages on either side of the API spectrum. This adaptability empowers modern organizations to stay at the forefront of technological advancements and fuel their drive for continuous innovation.

After reading Lin’s article, I firmly believe that API-first development is not just a passing trend but a revolutionary approach that unleashes the full potential of modern organizations. The importance of API-first design, teamwork, flexibility, and compatibility aligns with my personal experiences and goals. This methodology drives organizations towards increased agility, scalability, and efficiency, empowering them to succeed in the constantly changing digital world.

Thank you, Joyce Lin, for your insightful article on API-First Software Development for Modern Organizations.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

Managing Tech Debt: Balancing Speed & Quality

When faced with the discovery of technical debt within a team, there are three possible approaches to consider:

To effectively manage technical debt, it is crucial to strike a balance between speed and quality. This involves allocating sufficient time for proper planning, design, and testing of software, ensuring its long-term maintainability and scalability.

If you’d like to explore this topic further, the following resources can provide more insights:

Book Recommendation: “Thinking in Systems” by Donella H. Meadows

“Thinking in Systems” by Donella H. Meadows teaches readers how to develop essential skills for solving problems of all sizes. It emphasizes the importance of a holistic approach and nurturing positive outcomes, empowering readers to find proactive and effective solutions.

  • Meadows brings systems thinking into the tangible world, providing problem-solving skills on various scales.
  • The book emphasizes that global challenges like war, hunger, poverty, and environmental degradation are system failures.
  • Fixing isolated pieces is insufficient; interconnectedness and interdependence must be recognized.
  • “Thinking in Systems” equips readers with tools and mindset to navigate confusion and helplessness.
  • Embracing systems thinking leads to proactive and effective approaches.

Recommended for those seeking a holistic understanding and aiming for meaningful change.

System Design Interview — An Insider’s Guide: Volumes 1 & 2

Throughout the years, I have committed myself to continuously improving my skills in system design. My drive to pursue further knowledge and resources didn’t stem from seeking external validation or a new job opportunity. Instead, I sought to elevate my current role and excel in it. One of my go-to resources in this journey has been Alex Xu’s book, which has become a reliable companion. Every time I revisit it, I am reminded of crucial concepts and invigorated in my approach to problem-solving:

System Design Interview — An Insider’s Guide (Volume 1):

  • Solutions to 16 real system design scenarios, offering practical guidance for enterprise architects to enhance their problem-solving skills.

The book covers diverse topics, from scaling user traffic to designing complex systems like chat systems and search autocomplete systems.

System Design Interview — An Insider’s Guide (Volume 2):

  • A four-step framework serving as a systematic approach to system design interviews.
  • Detailed solutions to 13 real system design interview questions.
  • Over 300 diagrams offer visual explanations of various systems.

The book covers topics like proximity services, distributed message queues, and real-time gaming leaderboards, among others. It caters to readers who possess a basic understanding of distributed systems.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.

Overcoming Limitations: Creating Custom Views in Trino Connectors (JDBC) without Native Support

During a feasibility test using distributed SQL (Trino/Starburst) for handling large volume Adhoc SQL queries, a common challenge arose. Trino, an open-source distributed SQL query engine, supports various connectors for interacting with different data sources. However, we discovered that creating views/tables on Oracle-based connectors was not directly supported by Trino. In this article, we will explore a solution to overcome this limitation by leveraging a dummy connector to create custom views in Trino.

Solution Steps:

  • Create a Dummy Connector:

To enable the creation of custom views in Trino, we need to set up a dummy connector. This connector will serve as a catalog for storing the custom views and tables.

Create a new file named dummy.properties and add the following content:

connector.name=memory
  • Restart Trino:

Restart the Trino server to apply the configuration changes and make the dummy connector available.

  • Verify and Select the Catalog:

Check the available catalogs using the following command:

trino> show catalogs;
atalog
---------
dummy
jmx
memory
oracle
system
tpcds
tpch
(7 rows)
trino> use dummy.default;
USE
  • Create Custom Views:

Now that the dummy connector is set up and selected, we can create custom views using SQL statements. Let’s assume we want to create custom views based on tables from the oracle.hr schema. Note oracle is the connector for the Oracle database in this example.

-- Create custom view
CREATE VIEW cust_emp_v AS SELECT * FROM oracle.hr.emp;
CREATE VIEW cust_dept_v AS SELECT * FROM oracle.hr.dept;

This solution enables us to perform complex analytics and join data from multiple connectors seamlessly, creating tables/views in Trino. By sharing this article, I aim to assist others who may face similar challenges when working with Trino and Oracle databases.

The Rise of Analytical Engineering: Embracing a Data-Driven Future

I wanted to share my thoughts on an exciting trend that I believe is reshaping the data landscape: analytical engineering. As someone who has personally experienced this shift, I can confidently say that it holds immense potential and opens up thrilling opportunities.

Analytical engineering is at the forefront of the data analytics field, bridging the gap between traditional data engineering and advanced analytics. By combining the best of both worlds, it empowers organizations to uncover deeper insights and make informed, data-driven decisions.

What truly sets analytical engineering apart is its ability to connect data teams with business stakeholders. No longer confined to isolated data operations, analytical engineers actively participate in strategic discussions, contribute to shaping priorities, and align data initiatives with business objectives. This collaboration is a game-changer, driving tangible value and fueling business growth.

At the core of analytical engineering lies the power of SQL and data modelling. These skills enable analytical engineers to transform and analyze data, creating robust data models that generate accurate and actionable insights. By leveraging modern data stack tools like DBT, analytical engineers streamline the data pipeline, ensuring seamless data ingestion, transformation, and scheduling.

Another critical aspect of analytical engineering is the empowerment of self-service analytics. By providing intuitive tools and platforms, analytical engineers enable business users to explore and analyze data independently. This democratization of data fosters a culture of data-driven decision-making, empowering individuals at all levels to unlock valuable insights without relying solely on technical teams.

The demand for analytical engineering skills is skyrocketing as businesses increasingly recognize the competitive advantage of advanced analytics. Roles like analytics engineer offer professionals a unique opportunity to leverage their technical expertise while driving impactful business outcomes. It’s an exciting time to be part of this field, with competitive salaries and ample room for career growth.

As an Enterprise Solution Architect, I have personally witnessed the transformative power of analytical engineering. It is an exciting career path that merges technical excellence with business acumen, enabling professionals to shape priorities, drive innovation, and significantly impact organizational success. While analytical engineering takes the spotlight, it is important to acknowledge the continued importance of data engineering, as the two disciplines complement each other.

Stackademic

Thank you for reading until the end. Before you go:

  • Please consider clapping and following the writer! 👏
  • Follow us on Twitter(X), LinkedIn, and YouTube.
  • Visit Stackademic.com to find out more about how we are democratizing free programming education around the world.