Skip to content

Capacity Plan

Tip

  1. KB → MB = KB/10^3
  2. MB → GB = MB/10^3
  3. GB → TB = GB/10^3
  4. MB → TB = MB/10^6

Capacity planning is a critical aspect of system design. It involves estimating the maximum load that a system can handle, including the number of users, transactions, or data that the system can process without performance degradation. Effective capacity planning can help avoid bottlenecks, ensure system scalability, and prevent costly resource over-allocation. Capacity planning is an iterative process. As the system grows and evolves, the capacity requirements may change, and the capacity planning should be updated accordingly.

Key Steps

Here are some key steps involved in capacity planning:

Estimate Options

Start by estimating the scope of the system. Your initial measure for quantifying capacity is usually an assumed throughput, inferred from a similar feature or another number such as daily active users. Then, come up with a ballpark estimate of how much a request costs in bytes. For object storage, consider factors like versioning and the average size of different media files. Once you know how much each record costs, think of bandwidth per operation.

Peak QPS

Calculating peak Query Per Second (QPS) is important as it often dictates the capacity requirement of the design. Peak QPS refers to the highest rate at which a system will be expected to handle queries, often occurring during times of high usage or traffic spikes. This can be determined through historical data analysis, predicting event-driven peaks, or time-driven peaks.

Understanding Request Sizes

Assessing request sizes is crucial for determining bandwidth and storage requirements. Depending on the system's functionality and the nature of data it handles, you can make informed assumptions about the size of different types of requests.

Throughput Calculation

Throughput is an important metric for capacity planning. It represents the number of requests that can be handled by the system in a given time frame. Throughput can be calculated based on the system's requirements or inferred from relevant metrics like daily active users.

Estimating Server Requirement

With the estimated throughput and response time, you can estimate the number of servers needed to run the application. This calculation takes into account the average response time per request and the number of workers each server can handle.

Bandwidth & Data in Transit

Understanding each record’s cost is fundamental. You need to consider the bandwidth per operation in both directions - client to server (ingress) and server to client (egress).

Be Prepared to Operate

Much of the truly challenging work for capacity planning is in operating capacity. Sometimes, assumptions made during the design stage may not hold true in practice. Therefore, continuous planning and monitoring is extremely important. Writing some of the assumptions made at the design stage in the form of alerts can help detect misconceptions. Using concepts of threat modeling can help understand which safeguards are missing, what happens if a certain characteristic changes, or if a non-functional requirement changes.

Rounding and Approximation

It is difficult to perform complicated math operations during the interview. There is no need to spend valuable time to solve complicated math problems. Precision is not expected. Use round numbers and approximation to your advantage.

System's Metadata to Consider

When implementing a system design, metadata can provide valuable insights into the system's behavior and performance, which can inform decisions about system architecture, scaling, and optimization. These pieces of metadata can provide valuable insights for system design and can help inform decisions about things like load balancing, data partitioning, and resource allocation. It's important to regularly monitor and analyze this metadata to understand the system's behavior and performance, and to make informed decisions about system design and optimization.

Here are some common types of metadata used in system design:

Daily Active Users (DAU)

This refers to the number of unique users who engage with the system on a given day. It's a key metric for understanding user activity and growth. High DAU indicates a large user base and active engagement, while low DAU suggests fewer users or less engagement.

Average Requests by User

This refers to the number of requests made by each user. It can provide insights into user behavior and usage patterns.

Average Request Data Size

This refers to the size of the data included in each request. It's important for understanding the volume of data being processed and can impact system performance and scalability. Larger request sizes can increase the load on the system and slow down response times.

Read vs Write

This refers to the ratio of read operations to write operations in the system. The ratio can provide insights into the system's usage patterns and can impact system performance and scalability. For example, a higher read-to-write ratio might suggest a system that primarily retrieves data rather than updates it.

Calculate Request per Second(RPS)

Calculating the number of requests per second (RPS) is a critical part of system design and performance estimation. It helps determine how many requests your system can handle in a single second.

Considering:

  1. 1Mi DAU (Daily Active Users)
  2. 5 request per user on daily average
  3. 100.000 seconds per day or 10^5

Seconds per Day Calculation

Normaly, one day has 86.500 seconds. However, as mentioned in Round and Approximation note in Key Steps is normal to round to facilitate the calculation. So there is no problem to say that one day has 100.000 seconds or 10^5.

Calculating:

  1. 1.000.000 * 5 = 5.000.000 requests per day
  2. 5.000.000 / 100.000 = 50 requests per second (RPS)

OR

  1. 5 * 10^6 / 10^5 = 5 * 10^6-5
  2. 5 * 10^1 = 50 RPS

Write and Read

Considering:

  1. Read VS Write = 9:1

Calculate:

  1. Writes = 50rps * 0.1 = 5rps
  2. Read = 50 * 0.9 = 45rps

Purchases per Seconds

Considering:

  1. 5% of users buy something

Calculate:

  1. 5% of 1Mi = 50.000 per day or 5% of 10^6 = 5*10^4
  2. 5*10^4 / 10^5 (secods per day) = 5*10(4-5) = 5\10^-1
  3. 0.5 purchases per seconds

Calculate Peak Query Per Second (QPS)

Bandwidth

Bandwidth refers to the maximum rate of data transfer across a given path. It's typically measured in bits per second (bps), kilobits per second (Kbps), megabits per second (Mbps), gigabits per second (Gbps), or terabits per second (Tbps). Bandwidth is a crucial factor in system design as it directly influences the system's performance and scalability.

Steps to Calculate

  1. Identify the Data Transfer Rate: This is the speed at which data is transferred over the network. It's usually provided by the network service provider and is typically measured in bits per second (bps). For example, a 1 Gbps network has a data transfer rate of 1,000,000,000 bits per second.
  2. Calculate the Number of Users: This is the number of users that will be using the network simultaneously. For example, if you have a network with a data transfer rate of 1 Gbps and you have 100 users, each user will have a bandwidth of 10 Mbps (1 Gbps / 100 users)
  3. Calculate the Bandwidth Usage Per User: This is the amount of data that each user will be transferring over the network. For example, if each user is transferring 10 MB of data per second, the bandwidth usage per user will be 10 Mbps.
  4. Calculate the Total Bandwidth Usage: This is the total amount of data that will be transferred over the network. It's calculated by multiplying the bandwidth usage per user by the number of users. For example, if each user is using 10 Mbps and there are 100 users, the total bandwidth usage will be 1 Gbps.

First Example

Suppose you have a network with a data transfer rate of 1 Gbps (1,000,000,000 bits per second). You have 100 users, each of whom will be transferring 10 MB of data per second. The total bandwidth usage would be 1 Gbps (10 Mbps * 100 users).

Second Example

Considering:

  1. 50 request per second (rps)
  2. Each request size is 50Kb

Calculate:

  1. 50rps * 50kb = 2500Kb/s or
  2. 2500Kb/s / 1000 = 2.5Mb/s or
  3. 2.510^3 / 10^3 = 2.510^(3-3) = 2.5Mb/s

Storage

Storage in system design refers to the amount of data that the system will need to store over time. It's crucial for understanding the system's scalability and performance, and it often plays a key role in the system's cost and infrastructure requirements.

Steps to Calculate

  1. Identify the Amount of Data Each User Will Generate: This is the amount of data each user will generate and store in the system. For example, each user generates 1 MB of data per day.
  2. Estimate the Number of Users: This is the number of users that will be using the system. For example, if you have a social media platform with 100 million daily active users (DAUs).
  3. Calculate the Total Data Generation Per Day: This is the total amount of data that will be generated and stored in the system each day. It's calculated by multiplying the amount of data each user will generate by the number of users. For example, if each user generates 1 MB of data per day and there are 100 million users, then the total data generation per day will be 100,000 MB, or 100 GB.
  4. Estimate the Retention Period: This is the length of time that the system will retain user data. For example, the system retains user data for 3 years.
  5. Calculate the Total Storage Requirement: This is the total amount of storage that the system will need to store user data over the entire retention period. It's calculated by multiplying the total data generation per day by the number of days in the retention period. For example, if the system generates 100 GB of data per day and retains data for 3 years (approximately 1095 days), then the total storage requirement will be 109,500 GB, or 109.5 TB.
  6. Replication factor: The Replication Factor (RF) is equivalent to the number of nodes where data (rows and partitions) are replicated.

First Example

Considering:

  1. A social media platform with 10 million daily activate users (DAUs).
  2. Each user generates 1 MB of data per second.
  3. Replication factor is 3
  4. Retention period of 3 years.

Calculate:

10 million → 10.000.000 = 10 ^ 7

1 day → 10 ^ 5 seconds

1 year → 10 ^ 5 * 3.65 ^ 2 = 3.65*10^7 seconds

1 MB → TB = MB/10^6

1GB → TB = GB/10^3

  1. Storage per second(SPS) = data per second * RF = 1 MB/s * 3 = 3 MB/s
  2. Storage per day(SPD) = SPS * Seconds per day = 3 MB/s * 10 ^ 5 = 300.000 MB or 300GB
  3. Storage per year(SPY)
    • SPS * Seconds per year = 3 MB/s * 3.65*10^7 = 10.95^7MB or 10.95 ^ 7 / 10 ^ 6 = 10.95 * 10 = 109.5TB
    • SPD * Days per year = 300 GB * 365 = 109500GB OR 10.95 * 10 ^ 4 / 10 ^ 3 = 10.95 * 10 = 109.5TB
  4. Storage per Retention Period = SPY * Retention Period = 109.5TB * 3 = 328.5TB

References

  1. System Design — Capacity Estimation
  2. System Design: Capacity Planning basics
  3. Capacity Planning
  4. Back-of-the-envelope Estimation
  5. How to calculate network bandwidth requirements
  6. Back-of-the-Envelope Calculation in System Design Interviews: A Comprehensive Guide