As systems scale to serve millions or even billions of users, a single-server architecture no longer suffices. That’s where scalable infrastructure and distributed systems come into play—and that’s why top tech companies test your understanding of these topics in every senior-level system design interview.
This guide walks through the key distributed systems patterns and how they apply to real-world design problems.
🧱 Core Concepts of Scalable Infrastructure
🔹 1. Sharding and Partitioning
Sharding = splitting large datasets across multiple nodes to scale reads and writes.
Types of sharding:
- Hash-based (e.g., hash(user_id) % N)
- Range-based (e.g., user_id 0–1000 → shard A)
- Geo-based (e.g., users in US vs. EU)
Challenges:
- Hotspots & unbalanced traffic
- Rebalancing and resharding
- Cross-shard queries and joins
📌 Common Interview Example: “How would you shard a user database for 1 billion users?”
🔹 2. Replication
Replication ensures fault-tolerance and improves read throughput.
Types of replication:
- Master-Slave (Leader-Follower): writes go to leader, reads from followers
- Multi-Master: all nodes can write (conflict resolution needed)
- Leaderless: used in systems like DynamoDB
Trade-offs:
- Data consistency vs. availability
- Lag in replicas (eventual consistency)
- Conflict handling (last write wins, vector clocks)
📌 Common Interview Example: “Design a messaging app that doesn't lose messages if a node crashes.”
🔹 3. Distributed Queues
Message queues decouple services and help with asynchronous processing.
Popular tools: Kafka, RabbitMQ, AWS SQS
Design considerations:
- At-least-once vs. exactly-once semantics
- Message ordering and partitioning
- Dead-letter queues and retries
📌 Common Use Cases: Email delivery, video processing, log ingestion, metrics pipelines
🔹 4. Consensus & Coordination
In distributed systems, you need coordination to:
- Elect leaders
- Agree on configuration changes
- Maintain consistency across replicas
Common algorithms:
- Raft: simpler to reason about than Paxos
- Paxos: foundational but complex
- ZooKeeper / etcd: production-ready coordination services
📌 Example: “How do microservices agree on a leader for writing to a shared DB?”
🔹 5. Eventual Consistency & Conflict Resolution
In high-availability systems, eventual consistency is preferred over strong consistency.
Techniques:
- Vector clocks
- CRDTs (Conflict-Free Replicated Data Types)
- Read repair & hinted handoff (used in Cassandra)
📌 Example: “Design Dropbox. How would you resolve conflicting file updates?”
🛠️ Supporting Infrastructure
Load Balancing
- DNS round robin, L4 (TCP), or L7 (HTTP) load balancing
- Global load balancers (e.g., for multi-region deployments)
Service Discovery
- Helps services locate each other dynamically
- Tools: Consul, Eureka, Kubernetes DNS
Monitoring & Logging
- Logging: ELK Stack, Fluentd
- Metrics: Prometheus, Grafana
- Tracing: Jaeger, OpenTelemetry
🧠 How These Appear in Interviews
Topic | Sample Interview Question |
---|---|
Sharding | “Design a scalable database for user profiles” |
Replication | “Ensure high availability in a payment system” |
Queues | “Process millions of image uploads in near-real time” |
Consensus | “Design a distributed lock system for microservices” |
Eventual Consistency | “Design Amazon’s shopping cart” |
⚖️ Trade-Offs to Always Consider
Dimension | Trade-off Options |
---|---|
Consistency | Strong vs. eventual |
Latency | Synchronous vs. asynchronous processing |
Availability | Single region vs. multi-region |
Durability | In-memory vs. persistent storage |
Fault tolerance | Single AZ vs. cross-region replication |
✅ Summary Table
Concept | Why It Matters |
---|---|
Sharding | Enables horizontal scaling of storage |
Replication | Improves fault tolerance and read speed |
Queues | Decouple services and enable parallelism |
Consensus | Enables coordination and system reliability |
Eventual Consistency | Increases availability in distributed systems |