Scalable Infrastructure & Distributed Systems: Core Concepts for System Design Interviews

As systems scale to serve millions or even billions of users, a single-server architecture no longer suffices. That’s where scalable infrastructure and distributed systems come into play—and that’s why top tech companies test your understanding of these topics in every senior-level system design interview.

This guide walks through the key distributed systems patterns and how they apply to real-world design problems.

🧱 Core Concepts of Scalable Infrastructure

🔹 1. Sharding and Partitioning

Sharding = splitting large datasets across multiple nodes to scale reads and writes.

Types of sharding:

Hash-based (e.g., hash(user_id) % N)
Range-based (e.g., user_id 0–1000 → shard A)
Geo-based (e.g., users in US vs. EU)

Challenges:

Hotspots & unbalanced traffic
Rebalancing and resharding
Cross-shard queries and joins

📌 Common Interview Example: “How would you shard a user database for 1 billion users?”

🔹 2. Replication

Replication ensures fault-tolerance and improves read throughput.

Types of replication:

Master-Slave (Leader-Follower): writes go to leader, reads from followers
Multi-Master: all nodes can write (conflict resolution needed)
Leaderless: used in systems like DynamoDB

Trade-offs:

Data consistency vs. availability
Lag in replicas (eventual consistency)
Conflict handling (last write wins, vector clocks)

📌 Common Interview Example: “Design a messaging app that doesn't lose messages if a node crashes.”

🔹 3. Distributed Queues

Message queues decouple services and help with asynchronous processing.

Popular tools: Kafka, RabbitMQ, AWS SQS

Design considerations:

At-least-once vs. exactly-once semantics
Message ordering and partitioning
Dead-letter queues and retries

📌 Common Use Cases: Email delivery, video processing, log ingestion, metrics pipelines

🔹 4. Consensus & Coordination

In distributed systems, you need coordination to:

Elect leaders
Agree on configuration changes
Maintain consistency across replicas

Common algorithms:

Raft: simpler to reason about than Paxos
Paxos: foundational but complex
ZooKeeper / etcd: production-ready coordination services

📌 Example: “How do microservices agree on a leader for writing to a shared DB?”

🔹 5. Eventual Consistency & Conflict Resolution

In high-availability systems, eventual consistency is preferred over strong consistency.

Techniques:

Vector clocks
CRDTs (Conflict-Free Replicated Data Types)
Read repair & hinted handoff (used in Cassandra)

📌 Example: “Design Dropbox. How would you resolve conflicting file updates?”

🛠️ Supporting Infrastructure

Load Balancing

DNS round robin, L4 (TCP), or L7 (HTTP) load balancing
Global load balancers (e.g., for multi-region deployments)

Service Discovery

Helps services locate each other dynamically
Tools: Consul, Eureka, Kubernetes DNS

Monitoring & Logging

Logging: ELK Stack, Fluentd
Metrics: Prometheus, Grafana
Tracing: Jaeger, OpenTelemetry

🧠 How These Appear in Interviews

Topic	Sample Interview Question
Sharding	“Design a scalable database for user profiles”
Replication	“Ensure high availability in a payment system”
Queues	“Process millions of image uploads in near-real time”
Consensus	“Design a distributed lock system for microservices”
Eventual Consistency	“Design Amazon’s shopping cart”

⚖️ Trade-Offs to Always Consider

Dimension	Trade-off Options
Consistency	Strong vs. eventual
Latency	Synchronous vs. asynchronous processing
Availability	Single region vs. multi-region
Durability	In-memory vs. persistent storage
Fault tolerance	Single AZ vs. cross-region replication

✅ Summary Table

Concept	Why It Matters
Sharding	Enables horizontal scaling of storage
Replication	Improves fault tolerance and read speed
Queues	Decouple services and enable parallelism
Consensus	Enables coordination and system reliability
Eventual Consistency	Increases availability in distributed systems