System Design Checklist

System Design Checklist

Engineers spend a lot of time searching for answers. However, in system design, the quality of the questions often determines the quality of the solution.

When building a system, the goal is not just to build the system right, but to build the right system. This distinction becomes critical in large-scale distributed systems. Before diving into architecture diagrams, technologies, or algorithms, it is important to clarify the problem space and constraints.

Over time, I collected and curated the following checklist from multiple sources. It has helped me structure my thinking when designing production systems and when mentoring junior architects to avoid common design blind spots. The checklist is not meant to be rigid. Instead, think of it as a structured guide to ensure the important questions are not overlooked.

Use Cases and Problem Framing

Before discussing architecture, we must understand the problem.

Why / Who

Why does this system need to exist?
What problem does it solve?
Who are the primary users of the system?

Examples of users could include:

End users
Internal services
External partners
Developers using an API platform

Understanding the why and who shapes almost every design decision that follows.

Users and Traffic Volume

Once we understand the users, we need to estimate how the system will be used.

User Distribution
- Are users globally distributed?
- Are there geographic usage patterns?
- Does traffic vary across time zones?
MAU / DAU
- Monthly Active Users (MAU)
- Daily Active Users (DAU)
These numbers help estimate system scale.
Peak Traffic
- What are the peak hours for the system?
- Are there seasonal spikes (e.g., sales events)?
Read / Write Patterns
- What is the expected read/write ratio?
- Are there burst workloads or sudden traffic spikes?
Understanding traffic patterns is critical for designing storage, caching, and scaling strategies.

Success Metrics / KPIs

Define what success looks like.

Examples include:

Latency targets
Throughput
Error rates
User engagement
Business metrics (conversion, retention, etc.)

Without clear metrics, it is difficult to determine whether the system is actually delivering value.

Constraints and Trade-offs

Every system operates under constraints.

These could include:

Budget constraints
Latency requirements
Regulatory requirements
Infrastructure limitations
Customer SLA / SLO requirements

Understanding acceptable trade-offs is key. For example:

Consistency vs Availability
Cost vs Performance
Latency vs Throughput

Data Consistency

Clarify the consistency requirements.

Questions to consider:

Do we need strong consistency?
Is eventual consistency acceptable?
Which parts of the system require stricter guarantees?

Not all data needs strong consistency, and relaxing constraints can simplify the system significantly.

Regulatory Requirements

Identify the regulatory requirements.

Examples:

Encryption requirements
Data residency regulations
Compliance standards (GDPR, SOC2, HIPAA, etc.)

Regulatory considerations should be integrated into the design from the beginning.

Back-of-the-Envelope Estimation

Simple estimates help avoid under-designing or over-engineering the system.

Examples:

Requests per second
Storage requirements
Network bandwidth
Cache size

Rough calculations provide intuition about system scale.

High-Level Design

Once the problem is well understood, we can outline the architecture.

Core System Components

Identify the major components of the system. What existing components can we leverage

Examples:

API gateway
Application services
Databases
Message queues
Cache layers
Background workers

Architecture Diagram

Create a high-level diagram showing:

System components
Data flow
Interactions between services

The goal is to communicate the overall structure of the system clearly.

Integration Points

Define how different parts of the system interact.

This includes:

Client interfaces
Internal service communication
Third-party integrations

Understanding the system’s surface area helps identify potential bottlenecks and failure points.

APIs and Service Communication

Define service interfaces and communication patterns.

Examples:

REST APIs
gRPC
Event-driven messaging
Webhooks

Consider:

API versioning
Idempotency
Rate limits
Timeout and retry policies

Detailed Design

After the high-level architecture is established, we can drill into the details.

Data Storage

Choose appropriate storage technologies.

Examples:

SQL databases
NoSQL stores
Object storage

Key considerations:

Data models and schemas
Indexing strategies
Query optimization

Also consider:

Replication
Sharding / partitioning

Algorithms and Mental Models

Some systems require specific algorithms or data structures.

Examples:

Ranking algorithms
Deduplication strategies
Scheduling algorithms

Always consider alternative approaches and their trade-offs.

Caching Static data

Caching is essential for performance at scale.

Common caching layers include:

Distributed cache (Redis, Memcached)
Expiration Policy (TTLs)
Eviction Policy
Local in-memory cache
CDN caching

Key questions:

What data should be cached?
What is the cache invalidation strategy?

Fault Tolerance

Distributed systems fail in unpredictable ways.

Consider:

Service failures
Network partitions
Partial outages

Mitigation strategies include:

Retries
Circuit breakers
Graceful degradation
Redundancy

Non functional Characteristics

A well-designed system typically aims for the following qualities:

Scalability
Availability
Reliability
Maintainability

Balancing these characteristics while meeting business goals is the essence of system design.

Scalability

Design the system to handle growth.

Important considerations:

Horizontal scaling
Load balancing
Partitioning and sharding
Capacity planning

Also think about how the system behaves during sudden traffic spikes.

Trade-offs

No system is perfect.

Every design involves trade-offs such as:

Cost vs Performance
Consistency vs Availability
Latency vs Throughput
Simplicity vs Flexibility

Explicitly documenting these trade-offs makes the design easier to reason about.

Security and Compliance

Security must be integrated throughout the system.

Key elements include:

Authentication (AuthN)
Authorization (AuthZ)
Rate limiting
Input validation
Encryption

Security should never be an afterthought.

Infrastructure & Operations

Deployment Options

Where will the system run? Will it be a separate microservice or part of an existing service

Options include:

Cloud infrastructure
On-premise deployments
Hybrid environments

Consider scalability, operational complexity, and vendor lock-in.

Cost Optimization

Large systems can incur significant operational costs.

Opportunities for optimization include:

Autoscaling
Storage tiering
Efficient resource utilization
Automated lifecycle management

Backpressure

What happens when the system is overloaded?
Are requests queued, throttled, or dropped?

Disaster Recovery

What happens during regional outages?
What are the RTO / RPO requirements?

Deployment Strategy

Blue / Green deployments
Canary releases
Rollback strategy

Testing

Load testing
Chaos testing
Failure injection

Data Lifecycle

Data retention policies
Archival and cold storage
GDPR deletion requirements

Observability (M,A,L,T,E)

A production system must be observable.

MALTE is a useful mnemonic:

Metrics
Alerts
Logs
Traces
Events

These signals help diagnose issues and understand system behavior.

Additional practices include:

Dashboards for system health
Debugging tools
Distributed tracing

Platform Considerations

Developer Experience and Onboarding

If the system is a platform or API, developer experience becomes important.

Providing:

SDKs
Client libraries
Good documentation

can significantly improve adoption and reduce integration friction.

Table of Contents

System Design Checklist

Use Cases and Problem Framing

Why / Who

Users and Traffic Volume

Success Metrics / KPIs

Constraints and Trade-offs

Data Consistency

Regulatory Requirements

Back-of-the-Envelope Estimation

High-Level Design

Core System Components

Architecture Diagram

Integration Points

APIs and Service Communication

Detailed Design

Data Storage

Algorithms and Mental Models

Caching Static data

Fault Tolerance

Non functional Characteristics

Scalability

Trade-offs

Security and Compliance

Infrastructure & Operations

Deployment Options

Cost Optimization

Backpressure

Disaster Recovery

Deployment Strategy

Testing

Data Lifecycle

Observability (M,A,L,T,E)

Platform Considerations

Developer Experience and Onboarding